Commit graph

735 commits

Author SHA1 Message Date
Marko Mäkelä
447b8ba164 Merge 10.2 into 10.3 2019-04-29 17:54:10 +03:00
Marko Mäkelä
5fb4c0a8a8 Clean up ut_list
ut_list_validate(), ut_list_map(): Add variants with const Functor&
so that these functions can be called with an rvalue.

Remove wrapper macros, and add #ifdef UNIV_DEBUG around debug-only code.
2019-04-29 15:11:06 +03:00
Marko Mäkelä
1b577e4d8b MDEV-19356 Assertion 'space->free_limit == 0 || space->free_limit == free_limit'
fil_node_t::read_page0(): Do not replace up-to-date metadata with a
possibly old version of page 0 that is being reread from the file.
A more up-to-date page 0 could still exists in the buffer pool,
waiting to be written back to the file.
2019-04-29 11:43:22 +03:00
Marko Mäkelä
4d59f45260 Merge 10.2 into 10.3 2019-04-27 20:41:31 +03:00
Eugene Kosov
6c5c1f0b2f MDEV-19231 make DB_SUCCESS equal to 0
It's a micro optimization. On most platforms CPUs has instructions to
compare with 0 fast. DB_SUCCESS is the most popular outcome of functions
and this patch optimized code like (err == DB_SUCCESS)

BtrBulk::finish(): bogus assertion fixed

fil_node_t::read_page0(): corrected usage of os_file_read()

que_eval_sql(): bugus assertion removed. Apparently it checked that
the field was assigned after having been zero-initialized at
object creation.

It turns out that the return type of os_file_read_func() was changed
in mysql/mysql-server@98909cefbc (MySQL 5.7)
from ibool to dberr_t. The reviewer (if there was any) failed to
point out that because of future merges, it could be a bad idea to
change the return type of a function without changing the function name.

This change was applied to MariaDB 10.2.2 in
commit 2e814d4702 but the
MariaDB-specific code was not fully adjusted accordingly,
e.g. in fil_node_open_file(). Essentially, code like
!os_file_read(...) became dead code in MariaDB and later
in Mariabackup 10.2, and we could be dealing with an uninitialized
buffer after a failed page read.
2019-04-25 16:29:55 +03:00
Marko Mäkelä
d8303c3ee7 Merge 10.3 into 10.4 2019-04-08 08:22:34 +03:00
Marko Mäkelä
7f5849a809 MDEV-18309: Remove unused code 2019-04-07 12:05:12 +03:00
Marko Mäkelä
cc492bfd4f Merge 10.2 into 10.3 2019-04-07 11:49:50 +03:00
Marko Mäkelä
867617a976 MDEV-18309: InnoDB reports bogus errors about missing #sql-*.ibd on startup
This is a follow-up to MDEV-18733. As part of that fix, we made
dict_check_sys_tables() skip tables that would be dropped by
row_mysql_drop_garbage_tables().

DICT_ERR_IGNORE_DROP: A new mode where the file should not be attempted
to be opened.

dict_load_tablespace(): Do not try to load the tablespace if
DICT_ERR_IGNORE_DROP has been specified.

row_mysql_drop_garbage_tables(): Pass the DICT_ERR_IGNORE_DROP mode.

fil_space_for_table_exists_in_mem(): Remove a parameter.
The only caller that passed print_error_if_does_not_exist=true
was row_drop_single_table_tablespace().
2019-04-07 10:57:38 +03:00
Marko Mäkelä
71f9552fd8 recv_recovery_is_on(): Add UNIV_UNLIKELY
Normally, InnoDB is not in the process of executing crash recovery.
Provide a hint to the compiler that the recovery-related code paths
are rarely executed.
2019-04-06 21:25:43 +03:00
Marko Mäkelä
10dd290b4b MDEV-17380 innodb_flush_neighbors=ON should be ignored on SSD
For tablespaces that do not reside on spinning storage, it does
not make sense to attempt to write nearby pages when writing out
dirty pages from the InnoDB buffer pool. It is actually detrimental
to performance and to the life span of flash ROM storage.

With this change, MariaDB will detect whether an InnoDB file resides
on solid-state storage. The detection has been implemented for Linux
and Microsoft Windows. For other systems, we will err on the safe side
and assume that files reside on SSD.

As part of this change, we will reduce the number of fstat() calls
when opening data files on POSIX systems and slightly clean up some
file I/O code.

FIXME: os_is_sparse_file_supported() on POSIX works in a destructive
manner. Thus, we can only invoke it when creating files, not when
opening them.

For diagnostics, we introduce the column ON_SSD to the table
INFORMATION_SCHEMA.INNODB_TABLESPACES_SCRUBBING. The table
INNODB_SYS_TABLESPACES might seem more appropriate, but its purpose
is to reflect the contents of the InnoDB system table SYS_TABLESPACES,
which we would like to remove at some point.

On Microsoft Windows, querying StorageDeviceSeekPenaltyProperty
sometimes returns ERROR_GEN_FAILURE instead of ERROR_INVALID_FUNCTION
or ERROR_NOT_SUPPORTED. We will silently ignore also this error,
and assume that the file does not reside on SSD.

On Linux, the detection will be based on the files
/sys/block/*/queue/rotational and /sys/block/*/dev.
Especially for USB storage, it is possible that
/sys/block/*/queue/rotational will wrongly report 1 instead of 0.

fil_node_t::on_ssd: Whether the InnoDB data file resides on
solid-state storage.

fil_system_t::ssd: Collection of Linux block devices that reside on
non-rotational storage.

fil_system_t::create(): Detect ssd on Linux based on the contents
of /sys/block/*/queue/rotational and /sys/block/*/dev.

fil_system_t::is_ssd(dev_t): Determine if a Linux block device is
non-rotational. Partitions will be identified with the containing
block device by assuming that the least significant 4 bits of the
minor number identify a partition, and that the "partition number"
of the entire device is 0.
2019-04-01 12:00:56 +03:00
Sergei Golubchik
f97d879bf8 cmake: re-enable -Werror in the maintainer mode
now we can afford it. Fix -Werror errors. Note:
* old gcc is bad at detecting uninit variables, disable it.
* time_t is int or long, cast it for printf's
2019-03-27 22:51:37 +01:00
Marko Mäkelä
6b6fa3cdb1 MDEV-18644: Support full_crc32 for page_compressed
This is a follow-up task to MDEV-12026, which introduced
innodb_checksum_algorithm=full_crc32 and a simpler page format.
MDEV-12026 did not enable full_crc32 for page_compressed tables,
which we will be doing now.

This is joint work with Thirunarayanan Balathandayuthapani.

For innodb_checksum_algorithm=full_crc32 we change the
page_compressed format as follows:

FIL_PAGE_TYPE: The most significant bit will be set to indicate
page_compressed format. The least significant bits will contain
the compressed page size, rounded up to a multiple of 256 bytes.

The checksum will be stored in the last 4 bytes of the page
(whether it is the full page or a page_compressed page whose
size is determined by FIL_PAGE_TYPE), covering all preceding
bytes of the page. If encryption is used, then the page will
be encrypted between compression and computing the checksum.
For page_compressed, FIL_PAGE_LSN will not be repeated at
the end of the page.

FSP_SPACE_FLAGS (already implemented as part of MDEV-12026):
We will store the innodb_compression_algorithm that may be used
to compress pages. Previously, the choice of algorithm was written
to each compressed data page separately, and one would be unable
to know in advance which compression algorithm(s) are used.

fil_space_t::full_crc32_page_compressed_len(): Determine if the
page_compressed algorithm of the tablespace needs to know the
exact length of the compressed data. If yes, we will reserve and
write an extra byte for this right before the checksum.

buf_page_is_compressed(): Determine if a page uses page_compressed
(in any innodb_checksum_algorithm).

fil_page_decompress(): Pass also fil_space_t::flags so that the
format can be determined.

buf_page_is_zeroes(): Check if a page is full of zero bytes.

buf_page_full_crc32_is_corrupted(): Renamed from
buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32,
we always simply validate the checksum to the page contents,
while the physical page size is explicitly specified by an
unencrypted part of the page header.

buf_page_full_crc32_size(): Determine the size of a full_crc32 page.

buf_dblwr_check_page_lsn(): Make this a debug-only function, because
it involves potentially costly lookups of fil_space_t.

create_table_info_t::check_table_options(),
ha_innobase::check_if_supported_inplace_alter(): Do allow the creation
of SPATIAL INDEX with full_crc32 also when page_compressed is used.

commit_cache_norebuild(): Preserve the compression algorithm when
updating the page_compression_level.

dict_tf_to_fsp_flags(): Set the flags for page compression algorithm.
FIXME: Maybe there should be a table option page_compression_algorithm
and a session variable to back it?
2019-03-18 14:08:43 +02:00
Marko Mäkelä
2151aed44d Follow-up fix to MDEV-12026: FIL_SPACE_FLAGS trump fil_space_t::flags
Whenever we are reading the first page of a data file, we may have to
adjust the provisionally created fil_space_t::flags to match what is
actually inside the data files. In this way, we will never accidentally
change the format of a data file.

fil_node_t::read_page0(): After validating the FIL_SPACE_FLAGS,
always assign them to space->flags.

btr_root_adjust_on_import(), Datafile::validate_to_dd(),
fil_space_for_table_exists_in_mem(): Adapt to the fix
in fil_node_t::read_page0().

fsp_flags_try_adjust(): Skip the adjustment if full_crc32 is being
used. This adjustment was introduced in MDEV-11623 for upgrading
from MariaDB 10.1.0 to 10.1.20, which used an accidentally changed
format of FIL_SPACE_FLAGS. MariaDB before 10.4.3 never set the
flag that now indicates the full_crc32 format.
2019-03-18 13:10:28 +02:00
Thirunarayanan Balathandayuthapani
c0f47a4a58 MDEV-12026: Implement innodb_checksum_algorithm=full_crc32
MariaDB data-at-rest encryption (innodb_encrypt_tables)
had repurposed the same unused data field that was repurposed
in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN)
field of SPATIAL INDEX. Because of this, MariaDB was unable to
support encryption on SPATIAL INDEX pages.

Furthermore, InnoDB page checksums skipped some bytes, and there
are multiple variations and checksum algorithms. By default,
InnoDB accepts all variations of all algorithms that ever existed.
This unnecessarily weakens the page checksums.

We hereby introduce two more innodb_checksum_algorithm variants
(full_crc32, strict_full_crc32) that are special in a way:
When either setting is active, newly created data files will
carry a flag (fil_space_t::full_crc32()) that indicates that
all pages of the file will use a full CRC-32C checksum over the
entire page contents (excluding the bytes where the checksum
is stored, at the very end of the page). Such files will always
use that checksum, no matter what the parameter
innodb_checksum_algorithm is assigned to.

For old files, the old checksum algorithms will continue to be
used. The value strict_full_crc32 will be equivalent to strict_crc32
and the value full_crc32 will be equivalent to crc32.

ROW_FORMAT=COMPRESSED tables will only use the old format.
These tables do not support new features, such as larger
innodb_page_size or instant ADD/DROP COLUMN. They may be
deprecated in the future. We do not want an unnecessary
file format change for them.

The new full_crc32() format also cleans up the MariaDB tablespace
flags. We will reserve flags to store the page_compressed
compression algorithm, and to store the compressed payload length,
so that checksum can be computed over the compressed (and
possibly encrypted) stream and can be validated without
decrypting or decompressing the page.

In the full_crc32 format, there no longer are separate before-encryption
and after-encryption checksums for pages. The single checksum is
computed on the page contents that is written to the file.

We do not make the new algorithm the default for two reasons.
First, MariaDB 10.4.2 was a beta release, and the default values
of parameters should not change after beta. Second, we did not
yet implement the full_crc32 format for page_compressed pages.
This will be fixed in MDEV-18644.

This is joint work with Marko Mäkelä.
2019-02-19 18:50:19 +02:00
Marko Mäkelä
0a1c3477bf MDEV-18493 Remove page_size_t
MySQL 5.7 introduced the class page_size_t and increased the size of
buffer pool page descriptors by introducing this object to them.

Maybe the intention of this exercise was to prepare for a future
where the buffer pool could accommodate multiple page sizes.
But that future never arrived, not even in MySQL 8.0. It is much
easier to manage a pool of a single page size, and typically all
storage devices of an InnoDB instance benefit from using the same
page size.

Let us remove page_size_t from MariaDB Server. This will make it
easier to remove support for ROW_FORMAT=COMPRESSED (or make it a
compile-time option) in the future, just by removing various
occurrences of zip_size.
2019-02-07 12:21:35 +02:00
Marko Mäkelä
e80bcd7f64 Merge 10.3 into 10.4 2019-02-05 12:48:02 +02:00
Marko Mäkelä
ab2458c61f Merge 10.2 into 10.3 2019-02-04 15:12:14 +02:00
Thirunarayanan Balathandayuthapani
7c7161a1bd MDEV-18194 Incremental prepare tries to access page which is out of tablespace bounds
Problem:
=======
Mariabackup incremental prepare creates new tablespace when it encounter
new tablespace. It sets the intial size as FIL_IBD_FILE_INITIAL_SIZE (4).
But while applying redo log, it tries to access 5th page and then
it leads to out of tablespace error.

Fix:
===
While parsing the redo log record, track FSP_SIZE in recv_spaces for the
respective space id. Assign the recv_size for the tablespace when it
is loaded. Extend the tablespace depends on recv_size while applying
the redo log record.
2019-02-01 09:15:53 +02:00
Sergei Golubchik
9b76e2843b Merge branch '10.3' into 10.4 2019-01-26 01:13:41 +01:00
Marko Mäkelä
9bd80ada6f Merge 10.2 into 10.3 2019-01-25 16:35:13 +02:00
Eugene Kosov
31d0727a10 MDEV-18235: Changes related to fsync()
Remove fil_node_t::sync_event.

I had a discussion with kernel fellows and they said it's safe to call
fsync() simultaneously at least on VFS and ext4. So initially I wanted
to disable check for recent Linux but than I realized code is buggy.

Consider a case when one thread is inside fsync() and two others are
waiting inside os_event. First thread after fsync() calls os_event_set()
which is a broadcast! So two waiting threads will awake and may call
fsync() at the same time.

One fix is to add a notify_one() functionality to os_event but I decided
to remove incorrect check completely. Note, it works for one waiting
thread but not for more than one.

IMO it's ok to avoid existing bugs but there is not too much sense in
avoiding possible(!) bugs as this code does.

fil_space_t::is_in_rotation_list(), fil_space_t::is_in_unflushed_spaces():
Replace redundant bool fields with member functions.

fil_node_t::needs_flush: Replaces fil_node_t::modification_counter and
fil_node_t::flush_counter. We need to know whether there _are_ some
unflushed writes and we do not need to know _how many_ writes.

fil_system_t::modification_counter: Remove as not needed.
Even if we needed fil_node_t::modification_counter, every file
could have its own counter that would be incremented on each write.

fil_system_t::modification_counter is a global modification counter
for all files. It was incremented on every write. But whether some
file was flushed or not is an internal fil_node_t deal/state and
this makes fil_system_t::modification_counter useless.

Closes #1061
2019-01-25 15:40:04 +02:00
Marko Mäkelä
78829a5780 Merge 10.3 into 10.4 2019-01-24 22:42:35 +02:00
Marko Mäkelä
947b6b849d Merge 10.2 into 10.3 2019-01-24 16:14:12 +02:00
Marko Mäkelä
9a7281a703 Merge 10.1 into 10.2 2019-01-23 15:09:06 +02:00
Marko Mäkelä
3b6d2efcb1 Merge 10.0 into 10.1 2019-01-23 14:34:23 +02:00
Marko Mäkelä
31d592ba7d MDEV-18349 InnoDB file size changes are not safe when file system crashes
When InnoDB is invoking posix_fallocate() to extend data files, it
was missing a call to fsync() to update the file system metadata.
If file system recovery is needed, the file size could be incorrect.

When the setting innodb_flush_method=O_DIRECT_NO_FSYNC
that was introduced in MariaDB 10.0.11 (and MySQL 5.6) is enabled,
InnoDB would wrongly skip fsync() after extending files.

Furthermore, the merge commit d8b45b0c00
inadvertently removed XtraDB error checking for posix_fallocate()
which this fix is restoring.

fil_flush(): Add the parameter bool metadata=false to request that
fil_buffering_disabled() be ignored.

fil_extend_space_to_desired_size(): Invoke fil_flush() with the
extra parameter. After successful posix_fallocate(), invoke
os_file_flush(). Note: The bookkeeping for fil_flush() would not be
updated the posix_fallocate() code path, so the "redundant"
fil_flush() should be a no-op.
2019-01-23 12:18:00 +02:00
Marko Mäkelä
942a6bd009 MDEV-18349 InnoDB file size changes are not safe when file system crashes
fil_extend_space_to_desired_size(): Invoke fsync() after posix_fallocate()
in order to durably extend the file in a crash-safe file system.
2019-01-23 09:51:06 +02:00
Marko Mäkelä
9dc81d2a7a Merge 10.3 into 10.4 2019-01-17 13:21:27 +02:00
Marko Mäkelä
77cbaa96ad Merge 10.2 into 10.3 2019-01-17 12:38:46 +02:00
Marko Mäkelä
8e80fd6bfd Merge 10.1 into 10.2 2019-01-17 11:24:38 +02:00
Thirunarayanan Balathandayuthapani
ad376a05fa MDEV-18279 MLOG_FILE_WRITE_CRYPT_DATA fails to set the crypt_data->type for tablespace.
Problem:
========
MLOG_FILE_WRITE_CRYPT_DATA redo log fails to apply type for
the crypt_data present in the space. While processing the double-write
buffer pages, page fails to decrypt. It leads to warning message.

Fix:
====
Set the type while parsing MLOG_FILE_WRITE_CRYPT_DATA redo log.
If type and length is of invalid type then mark it as corrupted.
2019-01-17 13:10:45 +05:30
Thirunarayanan Balathandayuthapani
038785e1f8 MDEV-18183 Server startup fails while dropping garbage encrypted tablespace
- There is no need to wait for crypt thread to stop accessing space while
dropping the garbage encrypted tablespace during recover.
2019-01-16 23:02:39 +05:30
Marko Mäkelä
734510a44d Merge 10.3 into 10.4 2019-01-06 17:43:02 +02:00
Sergei Golubchik
6bb11efa4a Merge branch '10.2' into 10.3 2019-01-03 13:09:41 +01:00
Sergey Vojtovich
9e37537c4e MDEV-17441 - InnoDB transition to C++11 atomics
Trivial fil_space_t::n_pending_ops transition. Since it is not
obvious which memory barriers are supposed to be issued, seq_cst
memory order was preserved.
2018-12-29 14:09:33 +04:00
Sergey Vojtovich
03e461634d MDEV-17441 - InnoDB transition to C++11 atomics
fil_validate_count transition to Atomic_counter.
2018-12-27 22:46:38 +04:00
Marko Mäkelä
b7a9563b21 Merge 10.1 into 10.2 2018-12-21 09:43:35 +02:00
Marko Mäkelä
ed36fc353f MDEV-18025: Detect corrupted innodb_page_compression=zlib pages
In MDEV-13103, I made a mistake in the error handling of
page_compressed=1 decryption when the default
innodb_compression_algorithm=zlib is used.
Due to this mistake, with certain versions of zlib,
MariaDB would fail to detect a corrupted page.

The problem was uncovered by the following tests:
mariabackup.unencrypted_page_compressed
mariabackup.encrypted_page_compressed
2018-12-20 13:33:09 +02:00
Marko Mäkelä
977073e3dc After-merge fix 2018-12-18 12:38:38 +02:00
Marko Mäkelä
1b3770acda Merge 10.3 into 10.4 2018-12-18 11:53:31 +02:00
Marko Mäkelä
75e7e0b90f Merge 10.2 into 10.3 2018-12-18 11:51:44 +02:00
Marko Mäkelä
b5763ecd01 Merge 10.3 into 10.4 2018-12-18 11:33:53 +02:00
Marko Mäkelä
0032170940 Merge 10.1 into 10.2 2018-12-18 10:01:15 +02:00
Marko Mäkelä
84f119f25c MDEV-12112/MDEV-12114: Relax strict_innodb, strict_none
Starting with commit 6b6987154a
the encrypted page checksum is only computed with crc32.
Encrypted data pages that were written earlier can contain
other checksums, but new ones will only contain crc32.

Because of this, it does not make sense to implement strict
checks of innodb_checksum_algorithm for other than strict_crc32.

fil_space_verify_crypt_checksum(): Treat strict_innodb as innodb
and strict_none as none. That is, allow a match from any of the
algorithms none, innodb, crc32. (This is how it worked before the
second MDEV-12112 fix.)

Thanks to Thirunarayanan Balathandayuthapani for pointing this out.
2018-12-18 09:52:28 +02:00
Marko Mäkelä
45531949ae Merge 10.2 into 10.3 2018-12-18 09:15:41 +02:00
Marko Mäkelä
ed13a0d221 MDEV-12112: Support WITH_INNODB_BUG_ENDIAN_CRC32
fil_space_verify_crypt_checksum(): Compute the bug-compatible variant
of the CRC-32C checksum if the correct one does not match.
2018-12-17 22:45:21 +02:00
Marko Mäkelä
fae7e350a8 Merge 10.1 into 10.2 2018-12-17 22:35:32 +02:00
Marko Mäkelä
51a1fc737c Fix a compiler warning
fil_space_verify_crypt_checksum(): Add a dummy return statement
in case memory is corrupted and innodb_checksum_algorithm has
an invalid value.
2018-12-17 22:35:22 +02:00
Marko Mäkelä
7d245083a4 Merge 10.1 into 10.2 2018-12-17 20:15:38 +02:00