* preserve the graph in memory between statements
* keep it in a TABLE_SHARE, available for concurrent searches
* nodes are generally read-only, walking the graph doesn't change them
* distance to target is cached, calculated only once
* SIMD-optimized bloom filter detects visited nodes
* nodes are stored in an array, not List, to better utilize bloom filter
* auto-adjusting heuristic to estimate the number of visited nodes
(to configure the bloom filter)
* many threads can concurrently walk the graph. MEM_ROOT and Hash_set
are protected with a mutex, but walking doesn't need them
* up to 8 threads can concurrently load nodes into the cache,
nodes are partitioned into 8 mutexes (8 is chosen arbitrarily, might
need tuning)
* concurrent editing is not supported though
* this is fine for MyISAM, TL_WRITE protects the TABLE_SHARE and the
graph (note that TL_WRITE_CONCURRENT_INSERT is not allowed, because an
INSERT into the main table means multiple UPDATEs in the graph)
* InnoDB uses secondary transaction-level caches linked in a list in
in thd->ha_data via a fake handlerton
* on rollback the secondary cache is discarded, on commit nodes
from the secondary cache are invalidated in the shared cache
while it is exclusively locked
* on savepoint rollback both caches are flushed. this can be improved
in the future with a row visibility callback
* graph size is controlled by @@mhnsw_cache_size, the cache is flushed
when it reaches the threshold
1. introduce alpha. the value of 1.1 is optimal, so hard-code it.
2. hard-code ef_construction=10, best by test
3. rename hnsw_max_connection_per_layer to mhnsw_max_edges_per_node
(max_connection is rather ambiguous in MariaDB) and add a help text
4. rename hnsw_ef_search to mhnsw_min_limit and add a help text
* sysvars should be REQUIRED_ARG
* fix a mix of US and UK spelling (use US)
* use consistent naming
* work if VEC_DISTANCE arguments are in the swapped order (const, col)
* work if VEC_DISTANCE argument is NULL/invalid or wrong length
* abort INSERT if the value is invalid or wrong length
* store the "number of neighbors" in a blob in endianness-independent way
* use field->store(longlong, bool) not field->store(double)
* a lot more error checking everywhere
* cleanup after errors
* simplify calling conventions, remove reinterpret_cast's
* todo/XXX comments
* whitespaces
* use float consistently
memory management is still totally PoC quality
This commit includes the work done in collaboration with Hugo Wen from
Amazon:
MDEV-33408 Alter HNSW graph storage and fix memory leak
This commit changes the way HNSW graph information is stored in the
second table. Instead of storing connections as separate records, it now
stores neighbors for each node, leading to significant performance
improvements and storage savings.
Comparing with the previous approach, the insert speed is 5 times faster,
search speed improves by 23%, and storage usage is reduced by 73%, based
on ann-benchmark tests with random-xs-20-euclidean and
random-s-100-euclidean datasets.
Additionally, in previous code, vector objects were not released after
use, resulting in excessive memory consumption (over 20GB for building
the index with 90,000 records), preventing tests with large datasets.
Now ensure that vectors are released appropriately during the insert and
search functions. Note there are still some vectors that need to be
cleaned up after search query completion. Needs to be addressed in a
future commit.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
As well as the commit:
Introduce session variables to manage HNSW index parameters
Three variables:
hnsw_max_connection_per_layer
hnsw_ef_constructor
hnsw_ef_search
ann-benchmark tool is also updated to support these variables in commit
https://github.com/HugoWenTD/ann-benchmarks/commit/e09784e for branch
https://github.com/HugoWenTD/ann-benchmarks/tree/mariadb-configurable
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
Co-authored-by: Hugo Wen <wenhug@amazon.com>
the information about index algorithm was stored in two
places inconsistently split between both.
BTREE index could have key->algorithm == HA_KEY_ALG_BTREE, if the user
explicitly specified USING BTREE or HA_KEY_ALG_UNDEF, if not.
RTREE index had key->algorithm == HA_KEY_ALG_RTREE
and always had key->flags & HA_SPATIAL
FULLTEXT index had key->algorithm == HA_KEY_ALG_FULLTEXT
and always had key->flags & HA_FULLTEXT
HASH index had key->algorithm == HA_KEY_ALG_HASH or HA_KEY_ALG_UNDEF
long unique index always had key->algorithm == HA_KEY_ALG_LONG_HASH
In this commit:
All indexes except BTREE and HASH always have key->algorithm
set, HA_SPATIAL and HA_FULLTEXT flags are not used anymore (except
for storage to keep frms backward compatible).
As a side effect ALTER TABLE now detects FULLTEXT index renames correctly
When strict mode is enabled, all warnings during `INSERT` are
converted to errors regardless of their actual severity.
`WARN_SORTING_ON_TRUNCATED_LENGTH` is not considered severe enough
to be elevated to the ERROR level, and this commit fixes that
This task is inspired by the Percona implementation of
slow_query_log_always_write_time.
This task implements the variable log_slow_always_query_time (name
matching other MariaDB variables using the slow query log). The
default value for the variable is 31536000, which makes MariaDB
compatible with older installations.
For queries with execution time longer than log_slow_always_query_time
the variables log_slow_rate_limit and log_slow_min_examined_row_limit
will be ignored and the query will be written to the slow query log
if there is no other limitations (like log_slow_filter etc).
Other things:
- long_query_time internal variable renamed to log_slow_query_time.
- More descriptive information for "log_slow_query_time".
MDEV-27277 added warnings on truncation during sorting for SELECTs
but did not for DML operations. However, UPDATEs and DELETEs may also
perform sorting and thus produce warnings. This commit fixes that
This patch was suggested by Sergei Golubchik.
It reverts the second patch from the PR:
commit fa5eeb4931
Fixed ALTER TABLE NOCOPY keyword failure
and adds NOCOPY_SYM into keyword_func_sp_var_and_label.
The price is one extra shift/recuce conflict in yy_oracle.yy.
This should to tolerable.
This is a workaround patch to make buildbot green.
Renaming databases from db1/DB2 to m33020_db1/m33020_DB1
to make them unique. So the garbage left by other tests
does not show up any more.
The real problem will be fixed under terms of:
MDEV-35282 Performance schema does not clear package routines
to explicit row_start/row_end columns
In case of adding both system fields of same type (length, unsigned
flag) as old implicit system fields do the rename of implicit system
fields to the ones specified in ALTER, remove SYSTEM_INVISIBLE flag in
that case. Correct PERIOD clause must be specified in ALTER as well.
MDEV-34904 Inplace alter for implicit to explicit versioning is broken
Whether ALTER goes inplace and how it goes inplace depends on
handler_flags which goes from alter_info->flags by this logic:
ha_alter_info->handler_flags|= (alter_info->flags & ~flags_to_remove);
ALTER_VERS_EXPLICIT was not in flags_to_remove and its value (1ULL <<
35) clashed with ALTER_ADD_NON_UNIQUE_NON_PRIM_INDEX.
ALTER_VERS_EXPLICIT must not affect inplace, it is SQL-only so we
remove it from handler_flags.
* remove duplicate test file
* move all uuidv7 tests into plugin/type_uuid/mysql-test/type_uuid/
* remove mysys/ changes
* auto my_random_bytes() fallback - removes duplicate code from uuid,
and fixes all other users of my_random_bytes() that don't check
the return value (because, perhaps, they don't need crypto-strong
random bytes)
* End of 11.6 -> 11.7 in tests
* clarify the warning text
* UUID_VERSION_MASK()/UUID_VARIANT_MASK() must not depend on the version
* allow 4x more monotonic uuidv7 per millisecond - instead of stretching
1000 microseconds over 12 bits, let's use extra 2 bits as a counter
* rename for compatibility with Percona Server (uuid_v4, uuid_v7)
During a query execution some sorting and grouping operations
on strings may be involved. System variable max_sort_length defines
the maximum number of bytes to use when comparing strings during
sorting/grouping. Thus, the comparable parts of strings may be less
than their actual size, so the results of the query may be not
sorted/grouped properly.
To indicate that some comparisons were done on a truncated lengths,
a new warning has been introduced with this commit.
Adding support for the ROW data type in the stored function RETURNS clause:
- explicit ROW(..members...) for both sql_mode=DEFAULT and sql_mode=ORACLE
CREATE FUNCTION f1() RETURNS ROW(a INT, b VARCHAR(32)) ...
- anchored "ROW TYPE OF [db1.]table1" declarations for sql_mode=DEFAULT
CREATE FUNCTION f1() RETURNS ROW TYPE OF test.t1 ...
- anchored "[db1.]table1%ROWTYPE" declarations for sql_mode=ORACLE
CREATE FUNCTION f1() RETURN test.t1%ROWTYPE ...
Adding support for anchored scalar data types in RETURNS clause:
- "TYPE OF [db1.]table1.column1" for sql_mode=DEFAULT
CREATE FUNCTION f1() RETURNS TYPE OF test.t1.column1;
- "[db1.]table1.column1" for sql_mode=ORACLE
CREATE FUNCTION f1() RETURN test.t1.column1%TYPE;
Details:
- Adding a new sql_mode_t parameter to
sp_head::create()
sp_head::sp_head()
sp_package::create()
sp_package::sp_package()
to guarantee early initialization of sp_head::m_sql_mode.
Before this change, this member was not initialized at all during
CREATE FUNCTION/PROCEDURE/PACKAGE statements, and was not used.
Now it needs to be initialized to write properly the
mysql.proc.returns column, according to the create time sql_mode.
- Code refactoring to make the things simpler and functions smaller:
* Adding a new method
Field_row::row_create_fields(THD *thd, List<Spvar_definition> *list)
to make a Virtual_tmp_table with Fields for ROW members
from an explicit definition.
* Adding a new method
Field_row::row_create_fields(THD *thd, const Spvar_definition &def)
to make a Virtual_tmp_table with Fields for ROW members
from an explicit or a table anchored definition.
* Adding a new method
Item_args::add_array_of_item_field(THD *thd, const Virtual_tmp_table &vtable)
to create and array of Item_field corresponding to all Field instances
in a Virtual_tmp_table
* Removing Item_field_row::row_create_items(). It was decomposed
into the new methods described above.
* Moving the code from the loop body in sp_rcontext::init_var_items()
into a separate method Spvar_definition::make_item_field_row(),
to make the code clearer (smaller functions).
make_item_field_row() itself uses the new methods described above.
- Changing the data type of sp_head::m_return_field_def
from Column_definition to Spvar_definition.
So now it supports not only SQL column field types,
but also explicit ROW and anchored ROW data types,
as well as anchored column types.
- Adding a new Column_definition parameter to sp_head::create_result_field().
Before this patch, create_result_field() took the definition only
from m_return_field_def. Now it's also called with a local Column_definition
variable which contains the explicit definition resolved from an
anchored defition.
- Modifying sql_yacc.yy to support the new grammar.
Adding new helper methods:
* sf_return_fill_definition_row()
* sf_return_fill_definition_rowtype_of()
* sf_return_fill_definition_type_of()
- Fixing tests in:
* Virtual_tmp_table::setup_field_pointers() in sql_select.cc
* Send_field::normalize() in field.h
* store_column_type()
to prevent calling Type_handler_row::field_type(),
which is implemented a DBUG_ASSERT(0).
Before this patch the affected methods and functions were called only
for scalar data types. Now ROW is also possible.
- Adding a new virtual method Field::cols()
- Overriding methods:
Item_func_sp::cols()
Item_func_sp::element_index()
Item_func_sp::check_cols()
Item_func_sp::bring_value()
to support the ROW data type.
- Extending the rule sp_return_type to support
* explicit ROW and anchored ROW data types
* anchored scalar data types
- Overriding Field_row::sql_type() to print
the data type of an explicit ROW.
Changing the return type of the following functions:
- CURRENT_TIMESTAMP, CURRENT_TIMESTAMP(), NOW()
- SYSDATE()
- FROM_UNIXTIME()
from DATETIME to TIMESTAMP.
Note, the old function NOW() returning DATETIME is still available
as LOCALTIMESTAMP or LOCALTIMESTAMP(), e.g.:
SELECT
LOCALTIMESTAMP, -- DATETIME
CURRENT_TIMESTAMP; -- TIMESTAMP
The change in the functions return data type fixes some problems
that occurred near a DST change:
- Problem #1
INSERT INTO t1 (timestamp_field) VALUES (CURRENT_TIMESTAMP);
INSERT INTO t1 (timestamp_field) VALUES (COALESCE(CURRENT_TIMESTAMP));
could result into two different values inserted.
- Problem #2
INSERT INTO t1 (timestamp_field) VALUES (FROM_UNIXTIME(1288477526));
INSERT INTO t1 (timestamp_field) VALUES (FROM_UNIXTIME(1288477526+3600));
could result into two equal TIMESTAMP values near a DST change.
Additional changes:
- FROM_UNIXTIME(0) now returns SQL NULL instead of '1970-01-01 00:00:00'
(assuming time_zone='+00:00')
- UNIX_TIMESTAMP('1970-01-01 00:00:00') now returns SQL NULL instead of 0
(assuming time_zone='+00:00'
These additional changes are needed for consistency with TIMESTAMP fields,
which cannot store '1970-01-01 00:00:00 +00:00'
* fix paths to work when installed and not only from the source dir
* don't use a cnf file (no need to restart the server for this)
* set MYSQL_HOST to a valid hostname when testing an invalid MARIADB_HOST
* use invalid ip to have clients fail quickly and not waste time
on resolving the invalid hostname
followup for eedbb901e5
for large transaction
Description
===========
When a transaction commits, it copies the binlog events from
binlog cache to binlog file. Very large transactions
(eg. gigabytes) can stall other transactions for a long time
because the data is copied while holding LOCK_log, which blocks
other commits from binlogging.
The solution in this patch is to rename the binlog cache file to
a binlog file instead of copy, if the commiting transaction has
large binlog cache. Rename is a very fast operation, it doesn't
block other transactions a long time.
Design
======
* binlog_large_commit_threshold
type: ulonglong
scope: global
dynamic: yes
default: 128MB
Only the binlog cache temporary files large than 128MB are
renamed to binlog file.
* #binlog_cache_files directory
To support rename, all binlog cache temporary files are managed
as normal files now. `#binlog_cache_files` directory is in the same
directory with binlog files. It is created at server startup if it doesn't
exist. Otherwise, all files in the directory is deleted at startup.
The temporary files are named with ML_ prefix and the memorary address
of the binlog_cache_data object which guarantees it is unique.
* Reserve space
To supprot rename feature, It must reserve enough space at the
begin of the binlog cache file. The space is required for
Format description, Gtid list, checkpoint and Gtid events when
renaming it to a binlog file.
Since binlog_cache_data's cache_log is directly accessed by binlog log,
online alter and wsrep. It is not easy to update all the code. Thus
binlog cache will not reserve space if it is not session binlog cache or
wsrep session is enabled.
- m_file_reserved_bytes
Stores the bytes reserved at the begin of the cache file.
It is initialized in write_prepare() and cleared by reset().
The reserved file header is hide to callers. Thus there is no
change for callers. E.g.
- get_byte_position() still get the length of binlog data
written to the cache, but not the file length.
- truncate(0) will truncate the file to m_file_reserved_bytes but not 0.
- write_prepare()
write_prepare() is called everytime when anything is being written
into the cache. It will call init_file_reserved_bytes() to create
the cache file (if it doesn't exist) and reserve suitable space if
the data written exceeds buffer's size.
* Binlog_commit_by_rotate
It is used to encapsulate the code for remaing a binlog cache
tempoary file to binlog file.
- should_commit_by_rotate()
it is called by write_transaction_to_binlog_events() to check if
a binlog cache should be rename to a binlog file.
- commit()
That is the entry to rename a binlog cache and commit the
transaction. Both rename and commit are protected by LOCK_log,
Thus not other transactions can write anything into the renamed
binlog before it.
Rename happens in a rotation. After the new binlog file is generated,
replace_binlog_file() is called to:
- copy data from the new binlog file to its binlog cache file.
- write gtid event.
- rename the binlog cache file to binlog file.
After that the rotation will continue to succeed. Then the transaction
is committed in a seperated group itself. Its cache file will be
detached and cache log will be reset before calling
trx_group_commit_with_engines(). Thus only Xid event be written.
One change is that if the port is not supplied or out of bound, the
old behaviour is to print 3306. The new behaviour is to not print
it (if not supplied) or the out of bound value.
The existing syntax for CREATE SERVER
CREATE [OR REPLACE] SERVER [IF NOT EXISTS] server_name
FOREIGN DATA WRAPPER wrapper_name
OPTIONS (option [, option] ...)
option:
{ HOST character-literal
| DATABASE character-literal
| USER character-literal
| PASSWORD character-literal
| SOCKET character-literal
| OWNER character-literal
| PORT numeric-literal }
With this change we have:
option:
{ HOST character-literal
| DATABASE character-literal
| USER character-literal
| PASSWORD character-literal
| SOCKET character-literal
| OWNER character-literal
| PORT numeric-literal
| PORT quoted-numerical-literal
| identifier character-literal}
We store these options as a JSON field in the mysql.servers system
table. We retain the restriction that PORT needs to be a number, but
also allow it to be a quoted number, so that SHOW CREATE SERVER can be
used for dumping. Without an accompanied implementation of SHOW CREATE
SERVER, some mysqldump tests will fail. Therefore this commit should
be immediately followed by the one implementating SHOW CREATE SERVER,
with testing covering both.
The limit of socket length on unix according to libc is 108, see
sockaddr_un::sun_path, but in the table it is a string of max length
64, which results in truncation of socket and failure to connect by
plugins using servers such as spider.
Update `SESSION_USER()` behaviour to be comparable with `CURRENT_USER()`.
`SESSION_USER()` will return the user and host columns from `mysql.user`
used to authenticate the user when the session was created.
Historically `SESSION_USER()` was an alias of `USER()` function. The
main difference with `USER()` behaviour after this changes is that
`SESSION_USER()` now returns the host column from `mysql.user` instead of
the client host or ip.
NOTE: `SESSION_USER_IS_USER` old mode is added to make the change
backward compatible.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.
Only `mysql` client program was using $MYSQL_HOST as the default host.
Add the same feature in most other client programs but using
$MARIADB_HOST instead.
All new code of the whole pull request, including one or several files that are
either new files or modified ones, are contributed under the BSD-new license. I
am contributing on behalf of my employer Amazon Web Services, Inc.
When calculate_cond_selectivity_for_table() takes into account multi-
column selectivities from range access, it tries to take-into account
that selectivity for some columns may have been already taken into account.
For example, for range access on IDX1 using {kp1, kp2}, the selectivity
of restrictions on "kp2" might have already been taken into account
to some extent.
So, the code tries to "discount" that using rec_per_key[] estimates.
This seems to be wrong and unreliable: the "discounting" may produce a
rselectivity_multiplier number that hints that the overall selectivity
of range access on IDX1 was greater than 1.
Do a conservative fix: if we arrive at conclusion that selectivity of
range access on condition in IDX1 >1.0, clip it down to 1.