mariadb/strings/CHARSET_INFO.txt

285 lines
9.3 KiB
Text
Raw Normal View History

2004-10-18 15:25:28 +05:00
CHARSET_INFO
============
A structure containing data for charset+collation pair implementation.
Virtual functions that use this data are collected into separate
structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
2004-10-18 15:25:28 +05:00
typedef struct charset_info_st
{
uint number;
uint primary_number;
uint binary_number;
uint state;
const char *csname;
const char *name;
const char *comment;
uchar *ctype;
uchar *to_lower;
uchar *to_upper;
uchar *sort_order;
uint16 *tab_to_uni;
MY_UNI_IDX *tab_from_uni;
uchar state_map[256];
uchar ident_map[256];
uint strxfrm_multiply;
uint mbminlen;
uint mbmaxlen;
uint16 max_sort_char; /* For LIKE optimization */
2004-10-18 15:25:28 +05:00
MY_CHARSET_HANDLER *cset;
MY_COLLATION_HANDLER *coll;
} CHARSET_INFO;
CHARSET_INFO fields description:
===============================
Numbers (identifiers)
---------------------
number - an ID uniquely identifying this charset+collation pair.
primary_number - ID of a charset+collation pair, which consists
of the same character set and the default collation of this
character set. Not really used now. Intended to optimize some
parts of the code where we need to find the default collation
using its non-default counterpart for the given character set.
binary_number - ID of a charset+collation pair, which consists
2004-10-18 15:25:28 +05:00
of the same character set and the binary collation of this
character set. Not really used now.
2004-10-18 15:25:28 +05:00
Names
-----
csname - name of the character set for this charset+collation pair.
name - name of the collation for this charset+collation pair.
comment - a text comment, displayed in "Description" column of
2004-10-18 15:25:28 +05:00
SHOW CHARACTER SET output.
Conversion tables
-----------------
ctype - pointer to array[257] of "type of characters"
bit mask for each character, e.g., whether a
character is a digit, letter, separator, etc.
Monty 2004-10-21:
If you look at the macros, we use ctype[(char)+1].
ctype[0] is traditionally in most ctype libraries
reserved for EOF (-1). The idea is that you can use
the result from fgetc() directly with ctype[]. As
we have to be compatible with external ctype[] versions,
it's better to do it the same way as they do...
to_lower - pointer to array[256] used in LCASE()
2004-10-18 15:25:28 +05:00
to_upper - pointer to array[256] used in UCASE()
sort_order - pointer to array[256] used for strings comparison
In all Asian charsets these arrays are set up as follows:
- All bytes in the range 0x80..0xFF were marked as letters in the
ctype array.
- The to_lower and to_upper arrays map only ASCII letters.
UPPER() and LOWER() doesn't really work for multi-byte characters.
Most of the characters in Asian character sets are ideograms
anyway and they don't have case mapping. However, there are
still some characters from European alphabets.
For example:
_ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
_ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE
But they don't map to each other with UPPER and LOWER operations.
- The sort_order array is filled case insensitively for the
ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
range 0x80..0xFF for these collations:
cp932_japanese_ci,
euckr_korean_ci,
eucjpms_japanese_ci,
gb2312_chinese_ci,
sjis_japanese_ci,
ujis_japanese_ci.
So multi-byte characters are sorted just according to their codes.
- Two collations are still case insensitive for the ASCII characters,
but have special sorting order for multi-byte characters
(something more complex than just according to codes):
big5_chinese_ci
gbk_chinese_ci
So handlers for these collations use only the 0x00..0x7F part
of their sort_order arrays, and apply the special functions
for multi-byte characters
In Unicode character sets we have full support of UPPER/LOWER mapping,
for sorting order, and for character type detection.
"utf8_general_ci" still has the "old-fashioned" arrays
like to_upper, to_lower, sort_order and ctype, but they are
not really used (maybe only in some rare legacy functions).
2004-10-18 15:25:28 +05:00
Unicode conversion data
-----------------------
For 8-bit character sets:
2004-10-18 15:25:28 +05:00
tab_to_uni : array[256] of charset->Unicode translation
tab_from_uni: a structure for Unicode->charset translation
Non-8-bit charsets have their own structures per charset
hidden in corresponding ctype-xxx.c file and don't use
2004-10-18 15:25:28 +05:00
tab_to_uni and tab_from_uni tables.
Parser maps
-----------
state_map[]
ident_map[]
These maps are used to quickly identify whether a character is an
identifier part, a digit, a special character, or a part of another
SQL language lexical item.
2004-10-18 15:25:28 +05:00
Probably can be combined with ctype array in the future.
But for some reasons these two arrays are used in the parser,
while a separate ctype[] array is used in the other part of the
code, like fulltext, etc.
Miscellaneous fields
--------------------
2004-10-18 15:25:28 +05:00
strxfrm_multiply - how many times a sort key (that is, a string
that can be passed into memcmp() for comparison)
2004-10-18 15:25:28 +05:00
can be longer than the original string.
Usually it is 1. For some complex
collations it can be bigger. For example,
2004-10-18 15:25:28 +05:00
in latin1_german2_ci, a sort key is up to
two times longer than the original string.
2004-10-18 15:25:28 +05:00
e.g. Letter 'A' with two dots above is
substituted with 'AE'.
mbminlen - minimum multi-byte sequence length.
Now always 1 except for ucs2. For ucs2,
2004-10-18 15:25:28 +05:00
it is 2.
mbmaxlen - maximum multi-byte sequence length.
1 for 8-bit charsets. Can be also 2 or 3.
2004-10-18 15:25:28 +05:00
max_sort_char - for LIKE range
in case of 8-bit character sets - native code
of maximum character (max_str pad byte);
in case of UTF8 and UCS2 - Unicode code of the maximum
possible character (usually U+FFFF). This code is
converted to multi-byte representation (usually 0xEFBFBF)
and then used as a pad sequence for max_str.
in case of other multi-byte character sets -
max_str pad byte (usually 0xFF).
2004-10-18 15:25:28 +05:00
MY_CHARSET_HANDLER
==================
MY_CHARSET_HANDLER is a collection of character-set
related routines. Defined in m_ctype.h. Have the
following set of functions:
Multi-byte routines
2004-10-18 15:25:28 +05:00
------------------
ismbchar() - detects whether the given string is a multi-byte sequence
mbcharlen() - returns length of multi-byte sequence starting with
2004-10-18 15:25:28 +05:00
the given character
numchars() - returns number of characters in the given string, e.g.
in SQL function CHAR_LENGTH().
charpos() - calculates the offset of the given position in the string.
Used in SQL functions LEFT(), RIGHT(), SUBSTRING(),
INSERT()
well_formed_length()
- finds the length of correctly formed multi-byte beginning.
2004-10-18 15:25:28 +05:00
Used in INSERTs to cut a beginning of the given string
which is
a) "well formed" according to the given character set.
b) can fit into the given data type
2004-10-18 15:25:28 +05:00
Terminates the string in the good position, taking in account
multi-byte character boundaries.
2004-10-18 15:25:28 +05:00
lengthsp() - returns the length of the given string without trailing spaces.
2004-10-18 15:25:28 +05:00
Unicode conversion routines
---------------------------
mb_wc - converts the left multi-byte sequence into its Unicode code.
mc_mb - converts the given Unicode code into multi-byte sequence.
2004-10-18 15:25:28 +05:00
Case and sort conversion
2004-10-18 15:25:28 +05:00
------------------------
caseup_str - converts the given 0-terminated string to uppercase
casedn_str - converts the given 0-terminated string to lowercase
caseup - converts the given string to lowercase using length
casedn - converts the given string to lowercase using length
2004-10-18 15:25:28 +05:00
Number-to-string conversion routines
------------------------------------
snprintf()
long10_to_str()
longlong10_to_str()
The names are pretty self-describing.
2004-10-18 15:25:28 +05:00
String padding routines
-----------------------
fill() - writes the given Unicode value into the given string
with the given length. Used to pad the string, usually
with space character, according to the given charset.
String-to-number conversion routines
2004-10-18 15:25:28 +05:00
------------------------------------
strntol()
strntoul()
strntoll()
strntoull()
strntod()
These functions are almost the same as their STDLIB counterparts,
but also:
2004-10-18 15:25:28 +05:00
- accept length instead of 0-terminator
- are character set dependent
2004-10-18 15:25:28 +05:00
Simple scanner routines
-----------------------
scan() - to skip leading spaces in the given string.
Used when a string value is inserted into a numeric field.
MY_COLLATION_HANDLER
====================
strnncoll() - compares two strings according to the given collation
strnncollsp() - like the above but ignores trailing spaces
strnxfrm() - makes a sort key suitable for memcmp() corresponding
to the given string
like_range() - creates a LIKE range, for optimizer
wildcmp() - wildcard comparison, for LIKE
strcasecmp() - 0-terminated string comparison
instr() - finds the first substring appearance in the string
hash_sort() - calculates hash value taking into account
2004-10-18 15:25:28 +05:00
the collation rules, e.g. case-insensitivity,
accent sensitivity, etc.