This patch adds a way to override default collations
(or "character set collations") for desired character sets.
The SQL standard says:
> Each collation known in an SQL-environment is applicable to one
> or more character sets, and for each character set, one or more
> collations are applicable to it, one of which is associated with
> it as its character set collation.
In MariaDB, character set collations has been hard-coded so far,
e.g. utf8mb4_general_ci has been a hard-coded character set collation
for utf8mb4.
This patch allows to override (globally per server, or per session)
character set collations, so for example, uca1400_ai_ci can be set as a
character set collation for Unicode character sets
(instead of compiled xxx_general_ci).
The array of overridden character set collations is stored in a new
(session and global) system variable @@character_set_collations and
can be set as a comma separated list of charset=collation pairs, e.g.:
SET @@character_set_collations='utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci';
The variable is empty by default, which mean use the hard-coded
character set collations (e.g. utf8mb4_general_ci for utf8mb4).
The variable can also be set globally by passing to the server startup command
line, and/or in my.cnf.
Implementation:
Implementation is made according to json schema validation draft 2020
JSON schema basically has same structure as that of json object, consisting
of key-value pairs. So it can be parsed in the same manner as
any json object.
However, none of the keywords are mandatory, so making guess about the
json value type based only on the keywords would be incorrect.
Hence we need separate objects denoting each keyword.
So during create_object_and_handle_keyword() we create appropriate objects
based on the keywords and validate each of them individually on the json
document by calling respective validate() function if the type matches.
If any of them fails, return false, else return true.
Type_handler::partition_field_append_value() erroneously
passed the address of my_collation_contextually_typed_binary
to conversion functions copy_and_convert() and my_convert().
This happened because generate_partition_syntax_for_frm()
was called from mysql_create_frm_image() in the stage when
the fields in List<Create_field> can still contain unresolved
contextual collations, like "binary" in the reported crash scenario:
ALTER TABLE t CHANGE COLUMN a a CHAR BINARY;
Fix:
1. Splitting mysql_prepare_create_table() into two parts:
- mysql_prepare_create_table_stage1() interates through
List<Create_field> and calls Create_field::prepare_stage1(),
which performs basic attribute initialization, including
context collation resolution.
- mysql_prepare_create_table_finalize() - the rest of the
old mysql_prepare_create_table() code.
2. Changing mysql_create_frm_image():
It now calls:
- mysql_prepare_create_table_stage1() in the very
beginning, before the partition related code.
- mysql_prepare_create_table_finalize() in the end,
instead of the old mysql_prepare_create_table() call
3. Adding mysql_prepare_create_table() as a wrapper
for two calls:
mysql_prepare_create_table_stage1() ||
mysql_prepare_create_table_finalize()
so the code stays unchanged in the other places
where mysql_prepare_create_table() was used.
4. Changing prototype for Type_handler::Column_definition_prepare_stage1()
Removing arguments:
- handler *file
- ulonglong table_flags
Adding a new argument instead:
- column_definition_type_t type
This allows to call Column_definition_prepare_stage1() and
therefore to call mysql_prepare_create_table_stage1()
before instantiation of a handler.
This simplifies the code, because in case of a partitioned table,
mysql_create_frm_image() creates a handler of the underlying
partition first, the frees it and created a ha_partition
instance instead.
mysql_prepare_create_table() before the fix was called with the final
(ha_partition) handler.
5. Moving parts of Column_definition_prepare_stage1() which
need a pointer to handler and table_flags to
Column_definition_prepare_stage2().
This makes it easier to compare different costs and also allows
the optimizer to optimizer different storage engines more reliably.
- Added tests/check_costs.pl, a tool to verify optimizer cost calculations.
- Most engine costs has been found with this program. All steps to
calculate the new costs are documented in Docs/optimizer_costs.txt
- User optimizer_cost variables are given in microseconds (as individual
costs can be very small). Internally they are stored in ms.
- Changed DISK_READ_COST (was DISK_SEEK_BASE_COST) from a hard disk cost
(9 ms) to common SSD cost (400MB/sec).
- Removed cost calculations for hard disks (rotation etc).
- Changed the following handler functions to return IO_AND_CPU_COST.
This makes it easy to apply different cost modifiers in ha_..time()
functions for io and cpu costs.
- scan_time()
- rnd_pos_time() & rnd_pos_call_time()
- keyread_time()
- Enhanched keyread_time() to calculate the full cost of reading of a set
of keys with a given number of ranges and optional number of blocks that
need to be accessed.
- Removed read_time() as keyread_time() + rnd_pos_time() can do the same
thing and more.
- Tuned cost for: heap, myisam, Aria, InnoDB, archive and MyRocks.
Used heap table costs for json_table. The rest are using default engine
costs.
- Added the following new optimizer variables:
- optimizer_disk_read_ratio
- optimizer_disk_read_cost
- optimizer_key_lookup_cost
- optimizer_row_lookup_cost
- optimizer_row_next_find_cost
- optimizer_scan_cost
- Moved all engine specific cost to OPTIMIZER_COSTS structure.
- Changed costs to use 'records_out' instead of 'records_read' when
recalculating costs.
- Split optimizer_costs.h to optimizer_costs.h and optimizer_defaults.h.
This allows one to change costs without having to compile a lot of
files.
- Updated costs for filter lookup.
- Use a better cost estimate in best_extension_by_limited_search()
for the sorting cost.
- Fixed previous issues with 'filtered' explain column as we are now
using 'records_out' (min rows seen for table) to calculate filtering.
This greatly simplifies the filtering code in
JOIN_TAB::save_explain_data().
This change caused a lot of queries to be optimized differently than
before, which exposed different issues in the optimizer that needs to
be fixed. These fixes are in the following commits. To not have to
change the same test case over and over again, the changes in the test
cases are done in a single commit after all the critical change sets
are done.
InnoDB changes:
- Updated InnoDB to not divide big range cost with 2.
- Added cost for InnoDB (innobase_update_optimizer_costs()).
- Don't mark clustered primary key with HA_KEYREAD_ONLY. This will
prevent that the optimizer is trying to use index-only scans on
the clustered key.
- Disabled ha_innobase::scan_time() and ha_innobase::read_time() and
ha_innobase::rnd_pos_time() as the default engine cost functions now
works good for InnoDB.
Other things:
- Added --show-query-costs (\Q) option to mysql.cc to show the query
cost after each query (good when working with query costs).
- Extended my_getopt with GET_ADJUSTED_VALUE which allows one to adjust
the value that user is given. This is used to change cost from
microseconds (user input) to milliseconds (what the server is
internally using).
- Added include/my_tracker.h ; Useful include file to quickly test
costs of a function.
- Use handler::set_table() in all places instead of 'table= arg'.
- Added SHOW_OPTIMIZER_COSTS to sys variables. These are input and
shown in microseconds for the user but stored as milliseconds.
This is to make the numbers easier to read for the user (less
pre-zeros). Implemented in 'Sys_var_optimizer_cost' class.
- In test_quick_select() do not use index scans if 'no_keyread' is set
for the table. This is what we do in other places of the server.
- Added THD parameter to Unique::get_use_cost() and
check_index_intersect_extension() and similar functions to be able
to provide costs to called functions.
- Changed 'records' to 'rows' in optimizer_trace.
- Write more information to optimizer_trace.
- Added INDEX_BLOCK_FILL_FACTOR_MUL (4) and INDEX_BLOCK_FILL_FACTOR_DIV (3)
to calculate usage space of keys in b-trees. (Before we used numeric
constants).
- Removed code that assumed that b-trees has similar costs as binary
trees. Replaced with engine calls that returns the cost.
- Added Bitmap::find_first_bit()
- Added timings to join_cache for ANALYZE table (patch by Sergei Petrunia).
- Added records_init and records_after_filter to POSITION to remember
more of what best_access_patch() calculates.
- table_after_join_selectivity() changed to recalculate 'records_out'
based on the new fields from best_access_patch()
Bug fixes:
- Some queries did not update last_query_cost (was 0). Fixed by moving
setting thd->...last_query_cost in JOIN::optimize().
- Write '0' as number of rows for const tables with a matching row.
Some internals:
- Engine cost are stored in OPTIMIZER_COSTS structure. When a
handlerton is created, we also created a new cost variable for the
handlerton. We also create a new variable if the user changes a
optimizer cost for a not yet loaded handlerton either with command
line arguments or with SET
@@global.engine.optimizer_cost_variable=xx.
- There are 3 global OPTIMIZER_COSTS variables:
default_optimizer_costs The default costs + changes from the
command line without an engine specifier.
heap_optimizer_costs Heap table costs, used for temporary tables
tmp_table_optimizer_costs The cost for the default on disk internal
temporary table (MyISAM or Aria)
- The engine cost for a table is stored in table_share. To speed up
accesses the handler has a pointer to this. The cost is copied
to the table on first access. If one wants to change the cost one
must first update the global engine cost and then do a FLUSH TABLES.
This was done to be able to access the costs for an open table
without any locks.
- When a handlerton is created, the cost are updated the following way:
See sql/keycaches.cc for details:
- Use 'default_optimizer_costs' as a base
- Call hton->update_optimizer_costs() to override with the engines
default costs.
- Override the costs that the user has specified for the engine.
- One handler open, copy the engine cost from handlerton to TABLE_SHARE.
- Call handler::update_optimizer_costs() to allow the engine to update
cost for this particular table.
- There are two costs stored in THD. These are copied to the handler
when the table is used in a query:
- optimizer_where_cost
- optimizer_scan_setup_cost
- Simply code in best_access_path() by storing all cost result in a
structure. (Idea/Suggestion by Igor)
SELECT FROM JSON_TABLE
Analysis: When fix_fields_if_needed() is called, it doesnt check if operands
are valid because check_cols() is not called. So it doesn't error out and
eventually crashes.
Fix: Use fix_fields_if_needed_for_scalar() instead of
fix_fields_if_needed(). It filters the scalar and returns the error if
it occurs.
This commit is a fixup for MDEV-28762
Analysis: Some recursive json functions dont check for stack control
Fix: Add check_stack_overrun(). The last argument is NULL because it is not
used
- Adding data type aliases:
using Lex_column_charset_collation_attrs_st = Lex_charset_collation_st;
using Lex_column_charset_collation_attrs = Lex_charset_collation;
and using them all around the code (except lex_charset.*)
instead of the original names.
- Renaming Lex_field_type_st::lex_charset_collation()
to charset_collation_attrs()
- Renaming Column_definition::set_lex_charset_collation()
to set_charset_collation_attrs()
- Renaming Column_definition::lex_charset_collation()
to charset_collation_attrs()
Rationale:
The name "Lex_charset_collation" was a not very good name.
It does not tell details about its properties:
1. if the charset is optional (yes)
2. if the collation is optional (yes)
3. if the charset can be exact (yes) or context (no)
4. if the collation can be: exact (yes) or context (yes)
5. if the clauses can be repeated multiple times (yes)
We'll need a few new data types soon with different properties.
For example, to fix MDEV-27896 and MDEV-27782, we'll need a new
data type which is very like Lex_charset_collation, but additionally
supports CHARACTER SET DEFAULT (which is allowed on table and database level,
but is not allowed on the column level yet), i.e. with:
"the charset can be exact (yes) or context (yes)" in N3.
So we'll have to rename Lex_charset_collation to something else,
e.g.: Lex_exact_charset_extended_collation_attrs,
and add a new data type:
e.g. Lex_extended_charset_extended_collation_attrs
Also, we'll possibly allow CHARACTER SET DEFAULT at the column level for
consistency with other places. So the storge on the column level can change:
- from Lex_exact_charset_extended_collation_attrs
- to Lex_extended_charset_extended_collation_attrs
Adding the aliases introduces a convenient abstraction against
upcoming renames and c++ data type changes.
Range can be thought about in similar manner as wildcard (*) where
more than one elements are processed. To implement range notation, extended
json parser to parse the 'to' keyword and added JSON_PATH_ARRAY_RANGE for
path type. If there is 'to' keyword then use JSON_PATH_ARRAY range for
path type along with existing type.
This new integer to store the end index of range is n_item_end.
When there is 'to' keyword, store the integer in n_item_end else store in
n_item.
This patch can be viewed as combination of two parts:
1) Enabling '-' in the path so that the parser does not give out a warning.
2) Setting the negative index to a correct value and returning the
appropriate value.
1) To enable using the negative index in the path:
To make the parser not return warning when negative index is used in path
'-' needs to be allowed in json path characters. P_NEG is added
to enable this and is made recognizable by setting the 45th index of
json_path_chr_map[] to P_NEG (instead of previous P_ETC)
because 45 corresponds to '-' in unicode.
When the path is being parsed and '-' is encountered, the parser should
recognize it as parsing '-' sign, so a new json state PS_NEG is required.
When the state is PS_NEG, it means that a negative integer is
going to be parsed so set is_negative_index of current step to 1 and
n_item is set accordingly when integer is encountered after '-'.
Next proceed with parsing rest of the path and get the correct path.
Next thing is parsing the json and returning correct value.
2) Setting the negative index to a correct value and returning the value:
While parsing json if we encounter array and the path step for the array
is a negative index (n_item < 0), then we can count the number of elements
in the array and set n_item to correct corresponding value. This is done in
json_skip_array_and_count.
This patch also fixes:
MDEV-27690 Crash on `CHARACTER SET csname COLLATE DEFAULT` in column definition
MDEV-27853 Wrong data type on column `COLLATE DEFAULT` and table `COLLATE some_non_default_collation`
MDEV-28067 Multiple conflicting column COLLATE clauses are not rejected
MDEV-28118 Wrong collation of `CAST(.. AS CHAR COLLATE DEFAULT)`
MDEV-28119 Wrong column collation on MODIFY + CONVERT
in about a hundred of users of MY_BITMAP, only two were using its
built-in mutex, and only one of those two was actually needing it.
Remove the mutex from MY_BITMAP, remove all associated conditions
and checks in bitmap functions. Use an external LOCK_temp_pool
mutex and temp_pool_set_next/temp_pool_clear_bit acccessors.
Remove bitmap_init/bitmap_free, always use my_* versions.
This change removed 68 explict strlen() calls from the code.
The following renames was done to ensure we don't use the old names
when merging code from earlier releases, as using the new variables
for print function could result in crashes:
- charset->csname renamed to charset->cs_name
- charset->name renamed to charset->coll_name
Almost everything where mechanical changes except:
- Changed to use the new Protocol::store(LEX_CSTRING..) when possible
- Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible
- Changed to use String->append(LEX_CSTRING&) when possible
Other things:
- There where compiler issues with ensuring that all character set names
points to the same string: gcc doesn't allow one to use integer constants
when defining global structures (constant char * pointers works fine).
To get around this, I declared defines for each character set name
length.
Changes:
- To detect automatic strlen() I removed the methods in String that
uses 'const char *' without a length:
- String::append(const char*)
- Binary_string(const char *str)
- String(const char *str, CHARSET_INFO *cs)
- append_for_single_quote(const char *)
All usage of append(const char*) is changed to either use
String::append(char), String::append(const char*, size_t length) or
String::append(LEX_CSTRING)
- Added STRING_WITH_LEN() around constant string arguments to
String::append()
- Added overflow argument to escape_string_for_mysql() and
escape_quotes_for_mysql() instead of returning (size_t) -1 on overflow.
This was needed as most usage of the above functions never tested the
result for -1 and would have given wrong results or crashes in case
of overflows.
- Added Item_func_or_sum::func_name_cstring(), which returns LEX_CSTRING.
Changed all Item_func::func_name()'s to func_name_cstring()'s.
The old Item_func_or_sum::func_name() is now an inline function that
returns func_name_cstring().str.
- Changed Item::mode_name() and Item::func_name_ext() to return
LEX_CSTRING.
- Changed for some functions the name argument from const char * to
to const LEX_CSTRING &:
- Item::Item_func_fix_attributes()
- Item::check_type_...()
- Type_std_attributes::agg_item_collations()
- Type_std_attributes::agg_item_set_converter()
- Type_std_attributes::agg_arg_charsets...()
- Type_handler_hybrid_field_type::aggregate_for_result()
- Type_handler_geometry::check_type_geom_or_binary()
- Type_handler::Item_func_or_sum_illegal_param()
- Predicant_to_list_comparator::add_value_skip_null()
- Predicant_to_list_comparator::add_value()
- cmp_item_row::prepare_comparators()
- cmp_item_row::aggregate_row_elements_for_comparison()
- Cursor_ref::print_func()
- Removes String_space() as it was only used in one cases and that
could be simplified to not use String_space(), thanks to the fixed
my_vsnprintf().
- Added some const LEX_CSTRING's for common strings:
- NULL_clex_str, DATA_clex_str, INDEX_clex_str.
- Changed primary_key_name to a LEX_CSTRING
- Renamed String::set_quick() to String::set_buffer_if_not_allocated() to
clarify what the function really does.
- Rename of protocol function:
bool store(const char *from, CHARSET_INFO *cs) to
bool store_string_or_null(const char *from, CHARSET_INFO *cs).
This was done to both clarify the difference between this 'store' function
and also to make it easier to find unoptimal usage of store() calls.
- Added Protocol::store(const LEX_CSTRING*, CHARSET_INFO*)
- Changed some 'const char*' arrays to instead be of type LEX_CSTRING.
- class Item_func_units now used LEX_CSTRING for name.
Other things:
- Fixed a bug in mysql.cc:construct_prompt() where a wrong escape character
in the prompt would cause some part of the prompt to be duplicated.
- Fixed a lot of instances where the length of the argument to
append is known or easily obtain but was not used.
- Removed some not needed 'virtual' definition for functions that was
inherited from the parent. I added override to these.
- Fixed Ordered_key::print() to preallocate needed buffer. Old code could
case memory overruns.
- Simplified some loops when adding char * to a String with delimiters.
Address review input: switch Name_resolution_context::ignored_tables from
table_map to a list of TABLE_LIST objects. The rationale is that table
bits may be changed due to query rewrites, etc, which may potentially
require updating ignored_tables.
The query used a subquery of this form:
SELECT ...
WHERE
EXISTS( SELECT ...
FROM JSON_TABLE(outer_ref, ..) as JT
WHERE trivial_correlation_cond)
EXISTS-to-IN conversion code was unable to see that the subquery will
still be correlated after the trivial_correlation is removed, which
eventually caused a crash due to inability to construct a query plan.
Fixed by making Item_subselect::walk() also walk arguments of Table
Functions.
(Also fixes MDEV-25254).
Re-work Name Resolution for the argument of JSON_TABLE(json_doc, ....)
function. The json_doc argument can refer to other tables, but it can
only refer to the tables that precede[*] the JSON_TABLE(...) call.
[*] - For queries with RIGHT JOINs, the "preceding" is determined after
the query is normalized by converting RIGHT JOIN into left one.
The implementation is as follows:
- Table function arguments use their own Name_resolution_context.
- The Name_resolution_context now has a bitmap of tables that should be
ignored when searching for a field.
- get_disallowed_table_deps() walks the TABLE_LIST::nested_join tree
and computes a bitmap of tables that do not "precede" the given
JSON_TABLE(...) invocation (according the above definition of
"preceding").