Rather than add a small extra amount on the size of chunks, keep it
of the specified size. The rest of the chunk initialization code
adapts to this small size reduction. This has been made in the general
case, not just large pages, to keep it simple.
The chunks size is controlled by innodb-buffer-pool-chunk-size. In the
code increasing this by a descriptor table size length makes it
difficult with large pages. With innodb-buffer-pool-chunk-size set to 2M
the code before this commit would of added a small amount extra to this
value when it tried to allocate this. While not normally a problem it is
with large pages, it now requires addition space, a whole extra large
page. With a number of pools, or with 1G or 16G large pages this is
quite significant.
By removing this additional amount, DBAs can set
innodb-buffer-pool-chunk size to the large page size, or a multiple of
it, and actually get that amount allocated. Previously they had to fudge
a value less.
The innodb.test results show how this is fudged over a number of tests. With
this change the values are just between 488 and 500 depending on architecture
and build options.
Tested with --large-pages --innodb-buffer-pool-size=256M
--innodb-buffer-pool-chunk-size=2M on x86_64 with 2M default large page
size. Breaking before buf_pool init, one large page was allocated in
MyISAM, by the end of the function 128 huge pages where allocated as
expected. A further 16 pages where allocated for a 32M log buffer and
during startup 1 page was allocated briefly to the redo log.
This is a follow-up task to MDEV-12026, which introduced
innodb_checksum_algorithm=full_crc32 and a simpler page format.
MDEV-12026 did not enable full_crc32 for page_compressed tables,
which we will be doing now.
This is joint work with Thirunarayanan Balathandayuthapani.
For innodb_checksum_algorithm=full_crc32 we change the
page_compressed format as follows:
FIL_PAGE_TYPE: The most significant bit will be set to indicate
page_compressed format. The least significant bits will contain
the compressed page size, rounded up to a multiple of 256 bytes.
The checksum will be stored in the last 4 bytes of the page
(whether it is the full page or a page_compressed page whose
size is determined by FIL_PAGE_TYPE), covering all preceding
bytes of the page. If encryption is used, then the page will
be encrypted between compression and computing the checksum.
For page_compressed, FIL_PAGE_LSN will not be repeated at
the end of the page.
FSP_SPACE_FLAGS (already implemented as part of MDEV-12026):
We will store the innodb_compression_algorithm that may be used
to compress pages. Previously, the choice of algorithm was written
to each compressed data page separately, and one would be unable
to know in advance which compression algorithm(s) are used.
fil_space_t::full_crc32_page_compressed_len(): Determine if the
page_compressed algorithm of the tablespace needs to know the
exact length of the compressed data. If yes, we will reserve and
write an extra byte for this right before the checksum.
buf_page_is_compressed(): Determine if a page uses page_compressed
(in any innodb_checksum_algorithm).
fil_page_decompress(): Pass also fil_space_t::flags so that the
format can be determined.
buf_page_is_zeroes(): Check if a page is full of zero bytes.
buf_page_full_crc32_is_corrupted(): Renamed from
buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32,
we always simply validate the checksum to the page contents,
while the physical page size is explicitly specified by an
unencrypted part of the page header.
buf_page_full_crc32_size(): Determine the size of a full_crc32 page.
buf_dblwr_check_page_lsn(): Make this a debug-only function, because
it involves potentially costly lookups of fil_space_t.
create_table_info_t::check_table_options(),
ha_innobase::check_if_supported_inplace_alter(): Do allow the creation
of SPATIAL INDEX with full_crc32 also when page_compressed is used.
commit_cache_norebuild(): Preserve the compression algorithm when
updating the page_compression_level.
dict_tf_to_fsp_flags(): Set the flags for page compression algorithm.
FIXME: Maybe there should be a table option page_compression_algorithm
and a session variable to back it?
Whenever we are reading the first page of a data file, we may have to
adjust the provisionally created fil_space_t::flags to match what is
actually inside the data files. In this way, we will never accidentally
change the format of a data file.
fil_node_t::read_page0(): After validating the FIL_SPACE_FLAGS,
always assign them to space->flags.
btr_root_adjust_on_import(), Datafile::validate_to_dd(),
fil_space_for_table_exists_in_mem(): Adapt to the fix
in fil_node_t::read_page0().
fsp_flags_try_adjust(): Skip the adjustment if full_crc32 is being
used. This adjustment was introduced in MDEV-11623 for upgrading
from MariaDB 10.1.0 to 10.1.20, which used an accidentally changed
format of FIL_SPACE_FLAGS. MariaDB before 10.4.3 never set the
flag that now indicates the full_crc32 format.
* MDEV-16509 Improve wsrep commit performance with binlog disabled
Release commit order critical section early after trx_commit_low() if
binlog is not transaction coordinator. In order to avoid two phase commit,
binlog_hton is not registered for THD during IO_CACHE population.
Implemented a test which verifies that the transactions release
commit order early.
This optimization will change behavior during recovery as the commit
is not two phase when binlog is off. Fixed and recorded wsrep-recover-v25
and wsrep-recover to match the behavior.
* MDEV-18730 Ordering for wsrep binlog group commit
Previously out of order execution was allowed for wsrep commits.
Established proper ordering by populating wait_for_commit
for every wsrep THD and making group commit leader to wait for
prior commits before proceeding to trx_group_commit_leader().
* MDEV-18730 Added a test case to verify correct commit ordering
* MDEV-16509, MDEV-18730 Review fixes
Use WSREP_EMULATE_BINLOG() macro to decide if the binlog_hton
should be registered. Whitespace/syntax fixes and cleanups.
* MDEV-16509 Require binlog for galera_var_innodb_disallow_writes test
If the commit to InnoDB is done in one phase, the native InnoDB behavior
is that the transaction is committed in memory before it is persisted to
disk. This means that the innodb_disallow_writes=ON may not prevent
transaction to become visible to other readers before commit is completely
over. On the other hand, if the commit is two phase (as it is with binlog),
the transaction will be blocked in prepare phase.
Fixed the test to use binlog, which enforces two phase commit, which
in turn makes commit to block before the changes become visible to
other connections. This guarantees that the test produces expected
result.
The warning was removed as this is a common case that happens if the table
was dropped and later created during the same checkpoint or if there was
a bulk insert done on an empty table.
In 10.3, all records will be processed by purge due to MDEV-12288.
But, the insert undo records do not contain a transaction identifier.
row_purge_parse_undo_rec(): Use node->trx_id=TRX_ID_MAX for the
insert undo records. We cannot skip table lookups for these records
after DISCARD TABLESPACE other than by 'detaching' the table from
the undo logs by updating SYS_TABLES.ID on both DISCARD TABLESPACE
and IMPORT TABLESPACE.
Also, remove a redundant condition that was introduced
in the merge commit 814205f306.
recv_parse_log_recs(): Do not compare type if ptr==end_ptr
(we have reached the end of the redo log parsing buffer),
because it will not have been correctly initialized in that case.
GCC 6 and later can optimize away the memset() that is part of
mem_heap_zalloc() in a placement new call. So, instead of relying
on that kind of initialization, explicitly initialize the necessary
fields in the constructors.
que_common_t::que_common_t(): Initialize more fields in the
default constructor.
purge_vcol_info_t::purge_vcol_info_t(): Initialize all fields in
the default constructor.
purge_node_t::purge_node_t(): Initialize all necessary fields.
Reference:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71388https://gcc.gnu.org/ml/gcc/2016-02/msg00207.html
row_merge_create_fts_sort_index(): Initialize dict_col_t in
an unambiguous way. GCC 6 and later appear to be able to optimize
away the memset() that is part of mem_heap_zalloc() in the
placement new call. Let us avoid using placement new in order
to ensure that the objects will actually be initialized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71388https://gcc.gnu.org/ml/gcc/2016-02/msg00207.html
While the latter reference hints that the optimization is only
applicable to non-POD types (and dict_col_t does not define
any member functions before 10.2), it is most consistent to
use the same initialization across all versions.
This was caused by a combination of factors:
* MyISAM/Aria temporary tables historically never saved the state
to disk (MYI/MAI), because the state never needed to persist
* certain ALTER TABLE operations modify the original TABLE structure
and if they fail, the original table has to be reopened to
revert all changes (m_needs_reopen=1)
as a result, when ALTER fails and MyISAM/Aria temp table gets reopened,
it reads the stale state from the disk.
As a fix, MyISAM/Aria tables now *always* write the state to disk
on close, *unless* HA_EXTRA_PREPARE_FOR_DROP was done first. And
the server now always does HA_EXTRA_PREPARE_FOR_DROP before dropping
a temporary table.
purge_node_t::in_progress: Replaces purge_node_t::done.
Only present in debug builds.
purge_node_t::start(): Moved from the start of row_purge_step().
purge_node_t::end(): Replaces row_purge_end().
trx_purge_attach_undo_recs(): Omit a check from non-debug builds.
If a table has been dropped, rebuilt, or its tablespace has been
discarded or the table is corrupted, it does not make sense to
look up that table again while purging old undo log records.
purge_node_t::purge_node_t(): Replaces row_purge_node_create().
que_common_t::que_common_t(): Constructor.
row_import_update_index_root(): Remove the constant parameter
dict_locked=true, and update the table->def_trx_id in the cache.
purge_node_t::unavailable_table_id: The latest unavailable table ID,
to avoid future lookups.
purge_node_t::def_trx_id: The latest modification of the table
identified by unavailable_table_id, or TRX_ID_MAX.
purge_node_t::is_skipped(): Determine if a table should be skipped.
purge_node_t::skip(): Note that a table should be skipped.
In commit d30f17af49 the change of
the loop iteration broke another error handling path that did
"goto error_handling_drop_uncached". Cover this code path with
fault injection, and revert to the correct iteration.
There are two fault injection labels innodb_OOM_prepare_inplace_alter.
Their order was swapped in MDEV-11369, so that the label that used
to be covered in an ADD INDEX code path would become unreachable
because the label that is executed for any ALTER TABLE was executed
first. Let us introduce the label innodb_OOM_prepare_add_index
for the more specific case.
if create_index_dict() fails, we need to free ctx->add_index[a] too
This fixes innodb.alter_crash and innodb.instant_alter_debug
failures in ASAN_OPTIONS="abort_on_error=1" runs
row_merge_create_index_graph(): Relay the internal state
from dict_create_index_step(). Our caller should free the index
only if it was not copied, added to the cache, and freed.
row_merge_create_index(): Free the index template if it was
not added to the cache. This is a safer variant of the logic
that was introduced in 65070beffd in 10.2.
prepare_inplace_alter_table_dict(): Add additional fault injection
to exercise a code path where we have already added an index
to the cache.
row_mysql_handle_errors(): Correct the wrong error handling for
the code DB_FOREIGN_EXCEED_MAX_CASCADE that was introduced in
c0923d396a
commit 35f5429eda
Author: Jimmy Yang <jimmy.yang@oracle.com>
Date: Wed Oct 6 06:55:34 2010 -0700
Manual port Bug #Bug #54582 "stack overflow when opening many tables
linked with foreign keys at once" from mysql-5.1-security to
mysql-5.5-security again.
rb://391 approved by Heikki
No known test case exists for repeating the bug before MariaDB 10.2.
The scenario should be that DB_FOREIGN_EXCEED_MAX_CASCADE is returned,
then InnoDB wrongly skips the rollback to the start of the current
row operation, and finally the SQL layer commits the transaction.
Normally the SQL layer would roll back either the entire transaction or
to the start of the statement. In the faulty scenario, InnoDB would
leave the transaction in an inconsistent state, and the SQL layer could
commit the transaction.
disable inplace alter for adding stored generated columns.
This fixes mroonga/storage.column_generated_stored_add_column failures
in ASAN_OPTIONS="abort_on_error=1" runs
Also, add a test case that shows the bug without ASAN.
I know no test case for this bug in 10.1. So a test case will be
committed separately in 10.2
fts_reset_get_doc(): properly initialize fts_get_doc_t::cache
fts_fetch_index_words(): Restore the initialization len=0.
The test innodb_fts.create in 10.2 would end up in an infinite loop
if this assignment is removed, because a following iteration of the
while() loop would assign zip->zp->avail_in=len with the original value
instead of the 0 that was reset in the previous iteration.
Fix the warnings issued by GCC 8 -Wstringop-truncation
and -Wstringop-overflow in InnoDB and XtraDB.
This work is motivated by Jan Lindström. The patch mainly differs
from his original one as follows:
(1) We remove explicit initialization of stack-allocated string buffers.
The minimum amount of initialization that is needed is a terminating
NUL character.
(2) GCC issues a warning for invoking strncpy(dest, src, sizeof dest)
because if strlen(src) >= sizeof dest, there would be no terminating
NUL byte in dest. We avoid this problem by invoking strncpy() with
a limit that is 1 less than the buffer size, and by always writing
NUL to the last byte of the buffer.
(3) We replace strncpy() with memcpy() or strcpy() in those cases
when the result is functionally equivalent.
Note: fts_fetch_index_words() never deals with len==UNIV_SQL_NULL.
This was enforced by an assertion that limits the maximum length
to FTS_MAX_WORD_LEN. Also, the encoding that InnoDB uses for
the compressed fulltext index is not byte-order agnostic, that is,
InnoDB data files that use FULLTEXT INDEX are not portable between
big-endian and little-endian systems.
row_merge_create_fts_sort_index(): Initialize dict_col_t.
This fixes an access to uninitialized dict_col_t::ind when a debug
assertion in MariaDB 10.4 invokes is_dropped() in
rec_get_converted_size_comp_prefix_low(). Older MariaDB versions
seem to be unaffected by the uninitialized values, but it should
not hurt to initialize everything.
Only starting with MariaDB 10.3.8 (MDEV-16365), InnoDB can actually
handle ALTER IGNORE TABLE correctly when introducing a NOT NULL
attribute to a column that contains a NULL value. Between
MariaDB Server 10.0 and 10.2, we would incorrectly return an error
for ALTER IGNORE TABLE when the column contains a NULL value.
don't do anything special for stored generated columns
in MyISAM repair code.
add an assert that if there are virtual indexed columns, they
_must_ be beyond the file->s->base.reclength boundary
On an error (such as when an index cannot be dropped due to
FOREIGN KEY constraints), the field dict_index_t::to_be_dropped
was only being cleared in debug builds, even though the field
is available and being used also in non-debug builds.
This was a regression that was introduced by myself originally
in MySQL 5.7.6 and later merged to MariaDB 10.2.2, in
d39898de8e
An error manifested itself in the MariaDB Server 10.4 non-debug build,
involving instant ADD or DROP column. Because an earlier failed
ALTER TABLE operation incorrectly left the dict_index_t::to_be_dropped
flag set, the column pointers of the index fields would fail to be
adjusted for instant ADD or DROP column (MDEV-15562). The instant
ADD COLUMN in MariaDB Server 10.3 is unlikely to be affected by a
similar scenario, because dict_table_t::instant_add_column() in 10.3
is applying the transformations to all indexes, not skipping
to-be-dropped ones.
The problem with the InnoDB table attribute encryption_key_id is that it is
not being persisted anywhere in InnoDB except if the table attribute
encryption is specified and is something else than encryption=default.
MDEV-17320 made it a hard error if encryption_key_id is specified to be
anything else than 1 in that case.
Ideally, we would always persist encryption_key_id in InnoDB. But, then we
would have to be prepared for the case that when encryption is being enabled
for a table whose encryption_key_id attribute refers to a non-existing key.
In MariaDB Server 10.1, our best option remains to not store anything
inside InnoDB. But, instead of returning the error that MDEV-17320
introduced, we should merely issue a warning that the specified
encryption_key_id is going to be ignored if encryption=default.
To improve the situation a little more, we will issue a warning if
SET [GLOBAL|SESSION] innodb_default_encryption_key_id is being set
to something that does not refer to an available encryption key.
Starting with MariaDB Server 10.2, thanks to MDEV-5800, we could open the
table definition from InnoDB side when the encryption is being enabled,
and actually fix the root cause of what was reported in MDEV-17320.