x86 builds don't use SIMD, fast math and inlining causes
distances to be quite unstable and
1) comparison with the threshold no longer works, the distance calculated
twice between the same two vectors comes out differently
2) a bunch of identical vectors get the non-zero distance between
them and HNSW cross-links them with no outbound links (if there're
more than 2M identical vectors). Let's strengthen the select_neighbors
heuristic to skip neighbors that are too close to each other
MDEV-35418 suggests a better solution for this.
With MDEV-34915 adjusting the mtr output of session
variables to be in order, the original variable omission for
x86_32 (added by MDEV-31609 - e0b6db2) is no longer required.
MYSQL_TMP_DIR is not necessarily under MYSQLTEST_VARDIR (it's
definitely not in --parallel), so LOAD DATA INFILE cannot use
MYSQL_TMP_DIR, because secure_file_priv=MYSQLTEST_VARDIR
Except for LOAD DATA LOCAL INFILE, which reads the file through the
client, but only in non-embedded builds.
followup for 7aa28a2a54
When an empty password is set, the server doesn't call
st_mysql_auth::hash_password and leaves MYSQL_SERVER_AUTH_INFO::auth_string
empty.
Fix:
generate hashes by calling hash_password for empty passwords as well. This
changes the api behavior slightly, but since even old plugins support it,
we can ignore this.
Some empty passwords could be already stored with no salt, though. The user
will have to call SET PASSWORD once again, anyway the authentication wouldn't
have worked for such password.
now with streaming (MDEV-35032) we cannot longer free MHNSW_Trx
at the end of the search. Cannot even free it at the end of the
mhnsw_insert, because there can be a search running (INSERT ... SELECT).
Let's do reference counting, even though it's a thread-local object.
considering that users don't interact with MariaDB vector search directly,
but primarily use AI frameworks, we should use names familiar
to vector store connector writers and for AI framework users.
That is industry standard M and ef.
mhnsw_cache_size -> mhnsw_max_cache_size
mhnsw_distance_function -> mhnsw_default_distance
mhnsw_max_edges_per_node -> mhnsw_default_m
mhnsw_min_limit -> mhnsw_ef_search
inside CREATE TABLE:
max_edges_per_node -> m
distance_function -> distance
ALTER TABLE needs to open hlindex tables early enough, right after they
were created, so that cleanup after an error would see and delete them.
But they need to be external_lock-ed only in copy_data_between_tables,
after mysql_trans_prepare_alter_copy_data().
Let's move locking out of hlindex_open() into hlindex_lock()
Similarly to "ALTER TABLE fixes for high-level indexes", don't enable bulk
insert when issuing create ... insert into a table containing vector
index. InnoDB can't handle situation when bulk insert is enabled for
one table but disabled for another. We can't do bulk insert on vector
index as it does table updates currently.
* add Aria truncate checks
* do store_lock() with a correct TL_xxx level
* remove InnoDB workaround for missing store_lock (from MDEV-35032)
* don't start transaction in temp tables (for Aria, with a test case)
Since high-level index tables do not participate in thr_multi_lock(), added
explicit call to THR_LOCK::start_trans(). This is needed mostly for Aria to
handle transaction logging.
fix Field_vector::get_copy_func() for the case when length_bytes differ
fix do_copy_vec() to not guess length_bytes but take it from the field
(for keys length_bytes is always 2 for any length)
MDEV-35337 Server crash or assertion failure in join_read_first upon using vector distance in group by
allow Item_func_distance to be not only in tab->join->order,
but alternatively in tab->join->group_list
with streaming implemened mhnsw no longer needs to know
the LIMIT in advance. let's just cap it to avoid allocating
too much memory for the one step result set
init_from_binary_frm_image() wrongly assumed that
* if a table has primary key
* and it has the HA_PRIMARY_KEY_IN_READ_INDEX flag
* than ORDER BY any index automatically implies ORDER BY pk at the end,
that is for an index (a,b,c) ORDER BY a,b,c means ORDER BY a,b,c,pk
which is wrong, it holds not for _any index_ but only for indexes
that can be used for ORDER BY.
So, don't do `field->part_of_sortkey= share->keys_in_use`
but introduce `sort_keys_in_use` and use that.
switch to a more predictable, shorter, and more correct output
that is, print as many significant digits as necessary.
but not more (they'd be just zeros) and not less (it'd lose precision)