Use < TL_FIRST_WRITE for determining a READ transaction.
Use TL_FIRST_WRITE as the relative operator replacing TL_WRITE_ALLOW_WRITE
as the minimium WRITE lock type.
Ever since commit 007f68c37f,
ALTER TABLE no longer invokes handler::open() after
handler::commit_inplace_alter_table().
ha_innobase::reload_statistics(): Reload or recompute statistics
after ALTER TABLE.
innodb_notify_tabledef_changed(): A new function to invoke
ha_innobase::reload_statistics().
handlerton::notify_tabledef_changed(): Add the parameter handler*
so that ha_innobase::reload_statistics() can be invoked.
ha_partition::notify_tabledef_changed(),
partition_notify_tabledef_changed(): Pass through the call
to any partitions or subpartitions.
This is based on code that was supplied by Monty.
The assertion failed in handler::ha_reset upon SELECT under
READ UNCOMMITTED from table with index on virtual column.
This was the debug-only failure, though the problem is mush wider:
* MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw
bitmap.
* read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP
* The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE
* The pointers to the stored MY_BITMAPs, like orig_read_set etc, and
sometimes all_set and tmp_set, are assigned to the pointers.
* Sometimes tmp_use_all_columns is used to substitute the raw bitmap
directly with all_set.bitmap
* Sometimes even bitmaps are directly modified, like in
TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called.
The last three bullets in the list, when used together (which is mostly
always) make the program flow cumbersome and impossible to follow,
notwithstanding the errors they cause, like this MDEV-17556, where tmp_set
pointer was assigned to read_set, write_set and vcol_set, then its bitmap
was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call,
and then bitmap_clear_all(&tmp_set) was applied to all this.
To untangle this knot, the rule should be applied:
* Never substitute bitmaps! This patch is about this.
orig_*, all_set bitmaps are never substituted already.
This patch changes the following function prototypes:
* tmp_use_all_columns, dbug_tmp_use_all_columns
to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map*
* tmp_restore_column_map, dbug_tmp_restore_column_maps to accept
MY_BITMAP* instead of my_bitmap_map*
These functions now will substitute read_set/write_set/vcol_set directly,
and won't touch underlying bitmaps.
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
The idea of this fix is that it's enough to prevent the
next_auto_inc_val from incrementing if an error, to fix this problem
and also the MDEV-17333.
So this patch basically reverts the existing fix to the MDEV-17333.
Some functions on ha_partition call functions on all partitions, but handler->reset() is only called that pruned by m_partitions_to_reset. So Spider didn't clear pointer on unpruned partitions, if the unpruned partitions are used by next query, Spider reference the pointer that is already freed.
Some functions on ha_partition call functions on all partitions, but handler->reset() is only called that pruned by m_partitions_to_reset. So Spider didn't clear pointer on unpruned partitions, if the unpruned partitions are used by next query, Spider reference the pointer that is already freed.
This failure was caused because of several bugs:
- Someone had removed s3-slave-ignore-updates=1 from slave.cnf, which
caused the slave to remove files that the master was working on.
- Bug in ha_partition::change_partitions() that didn't reset m_new_file
in case of errors. This caused crashes in ha_maria::extra() as the
maria handler was called on files that was already closed.
- In ma_pagecache there was a bug that when one got a read error one a
big block (s3 block), it left the flag PCBLOCK_BIG_READ on for the page
which cased an assert when the page where flushed.
- Flush all cached tables in case of ignored ALTER TABLE
Note that when merging code from 10.3, that fixes the partition bug, use
the code from this patch instead.
Changes to ma_pagecache.cc written or reviewed by Sanja
Some functions on ha_partition call functions on all partitions, but handler->reset() is only called that pruned by m_partitions_to_reset. So Spider didn't clear pointer on unpruned partitions, if the unpruned partitions are used by next query, Spider reference the pointer that is already freed.
This also fixes some issues with
MDEV-23730 s3.replication_partition 'innodb,mix' segv
The problem was that mysql_change_partitions() closes all handler files
in case of error, which was not properly reflected in
fast_alter_partition_table(). This caused handle_alter_part_error() to
try to close already closed tables, which caused the crash.
Fixed fast_alter_partion_table() to reflect when tables are opened.
I also fixed that ha_partition::change_partitions() resets m_new_file in
case of errors.
Either of the above changes fixes the issue, but both are needed to ensure
that the code works as expected.
first step in moving drop table out of the handler.
todo: other methods that don't need an open table
for now hton->drop_table is optional, for backward compatibility
reasons
- Some of the bug fixes are backports from 10.5!
- The fix in innobase/fil/fil0fil.cc is just a backport to get less
error messages in mysqld.1.err when running with valgrind.
- Renamed HAVE_valgrind_or_MSAN to HAVE_valgrind
Apply this patch from Percona Server (amended for 10.5):
commit cd7201514fee78aaf7d3eb2b28d2573c76f53b84
Author: Laurynas Biveinis <laurynas.biveinis@gmail.com>
Date: Tue Nov 14 06:34:19 2017 +0200
Fix bug 1704195 / 87065 / TDB-83 (Stop ANALYZE TABLE from flushing table definition cache)
Make ANALYZE TABLE stop flushing affected tables from the table
definition cache, which has the effect of not blocking any subsequent
new queries involving the table if there's a parallel long-running
query:
- new table flag HA_ONLINE_ANALYZE, return it for InnoDB and TokuDB
tables;
- in mysql_admin_table, if we are performing ANALYZE TABLE, and the
table flag is set, do not remove the table from the table
definition cache, do not invalidate query cache;
- in partitioning handler, refresh the query optimizer statistics
after ANALYZE if the underlying handler supports HA_ONLINE_ANALYZE;
- new testcases main.percona_nonflushing_analyze_debug,
parts.percona_nonflushing_abalyze_debug and a supporting debug sync
point.
For TokuDB, this change exposes bug TDB-83 (Index cardinality stats
updated for handler::info(HA_STATUS_CONST), not often enough for
tokudb_cardinality_scale_percent). TokuDB may return different
rec_per_key values depending on dynamic variable
tokudb_cardinality_scale_percent value. The server does not have a way
of knowing that changing this variable invalidates the previous
rec_per_key values in any opened table shares, and so does not call
info(HA_STATUS_CONST) again. Fix by updating rec_per_key for both
HA_STATUS_CONST and HA_STATUS_VARIABLE. This also forces a re-record
of tokudb.bugs.db756_card_part_hash_1_pick, with the new output
seeming to be more correct.
reduce the amount of engine-specific code in the server,
particularly as it does not serve any purpose now.
may be needed for VP engine,
to be reconsidered in MDEV-7795
MDEV-22531 Remove maria::implicit_commit()
MDEV-22607 Assertion `ha_info->ht() != binlog_hton' failed in
MYSQL_BIN_LOG::unlog_xa_prepare
From the handler point of view, Aria now looks like a transactional
engine. One effect of this is that we don't need to call
maria::implicit_commit() anymore.
This change also forces the server to call trans_commit_stmt() after doing
any read or writes to system tables. This work will also make it easier
to later allow users to have system tables in other engines than Aria.
To handle the case that Aria doesn't support rollback, a new
handlerton flag, HTON_NO_ROLLBACK, was added to engines that has
transactions without rollback (for the moment only binlog and Aria).
Other things
- Moved freeing of MARIA_SHARE to a separate function as the MARIA_SHARE
can be still part of a transaction even if the table has closed.
- Changed Aria checkpoint to use the new MARIA_SHARE free function. This
fixes a possible memory leak when using S3 tables
- Changed testing of binlog_hton to instead test for HTON_NO_ROLLBACK
- Removed checking of has_transaction_manager() in handler.cc as we can
assume that as the transaction was started by the engine, it does
support transactions.
- Added new class 'start_new_trans' that can be used to start indepdendent
sub transactions, for example while reading mysql.proc, using help or
status tables etc.
- open_system_tables...() and open_proc_table_for_Read() doesn't anymore
take a Open_tables_backup list. This is now handled by 'start_new_trans'.
- Split thd::has_transactions() to thd::has_transactions() and
thd::has_transactions_and_rollback()
- Added handlerton code to free cached transactions objects.
Needed by InnoDB.
squash! 2ed35999f2a2d84f1c786a21ade5db716b6f1bbc
MDEV-22088 S3 partitioning support
All ALTER PARTITION commands should now work on S3 tables except
REBUILD PARTITION
TRUNCATE PARTITION
REORGANIZE PARTITION
In addition, PARTIONED S3 TABLES can also be replicated.
This is achived by storing the partition tables .frm and .par file on S3
for partitioned shared (S3) tables.
The discovery methods are enchanced by allowing engines that supports
discovery to also support of the partitioned tables .frm and .par file
Things in more detail
- The .frm and .par files of partitioned tables are stored in S3 and kept
in sync.
- Added hton callback create_partitioning_metadata to inform handler
that metadata for a partitoned file has changed
- Added back handler::discover_check_version() to be able to check if
a table's or a part table's definition has changed.
- Added handler::check_if_updates_are_ignored(). Needed for partitioning.
- Renamed rebind() -> rebind_psi(), as it was before.
- Changed CHF_xxx hadnler flags to an enum
- Changed some checks from using table->file->ht to use
table->file->partition_ht() to get discovery to work with partitioning.
- If TABLE_SHARE::init_from_binary_frm_image() fails, ensure that we
don't leave any .frm or .par files around.
- Fixed that writefrm() doesn't leave unusable .frm files around
- Appended extension to path for writefrm() to be able to reuse to function
for creating .par files.
- Added DBUG_PUSH("") to a a few functions that caused a lot of not
critical tracing.
* rename to a generic name
* move remaning initializations from query exec to prepare time
* simplify/unify key handling in open_table_from_share and delayed
* remove dead code
* move tests where they belong
Sergei's commit ac6b3c4430 implemented handler status counters
compensation for underlying handlers like ha_partition.
`index_read_idx_map` is missing there, but it should have been fixed as
well (proof: ha_partition::index_read_idx_map never calls
ha_partition::index_read_map).
Note: all this compensation logic could be broken for subpartitions! (We
can experience double decrement)