mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-31 02:51:44 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	0a573e7e63	Merge 10.5 into 10.6	2022-03-29 19:49:29 +03:00
Marko Mäkelä	7d7bdd4aaa	MDEV-28185 InnoDB generates redundant log checkpoints The comparison on the checkpoint age (number of log bytes written since the previous checkpoint) is inaccurate, because the previous FILE_CHECKPOINT record could span two 512-byte log blocks, which will cause the LSN to increase by the size of the log block header and footer. We will still generate a redudant checkpoint if the previous checkpoint wrote some FILE_MODIFY records before the FILE_CHECKPOINT record.	2022-03-29 19:42:10 +03:00
Marko Mäkelä	b242c3141f	Merge 10.5 into 10.6	2022-03-29 16:16:21 +03:00
Marko Mäkelä	42609c240d	Cleanup: Replace log_sys.n_pending_checkpoint_writes with a Boolean Only one checkpoint may be in progress at a time. The counter log_sys.n_pending_checkpoint_writes was being protected by log_sys.mutex. Let us replace it with the Boolean log_sys.checkpoint_pending.	2022-03-29 14:56:44 +03:00
Marko Mäkelä	c14f60a72f	Fix g++-12 -O2 -Wstringop-overflow buf_pool_t::watch_unset(): Reorder some code so that no warning will be emitted in CMAKE_BUILD_TYPE=RelWithDebInfo. It is unclear why invoking watch_is_sentinel() before buf_fix_count() would make the warning disappear.	2022-03-29 12:59:38 +03:00
Marko Mäkelä	d62b0368ca	Merge 10.4 into 10.5	2022-03-29 12:59:18 +03:00
Marko Mäkelä	ae6e214fd8	Merge 10.3 into 10.4	2022-03-29 11:13:18 +03:00
Marko Mäkelä	020e7d89eb	Merge 10.2 into 10.3	2022-03-29 09:53:15 +03:00
Marko Mäkelä	303448bc91	MDEV-27931: buf_page_is_corrupted() wrongly claims corruption In commit `437da7bc54` (MDEV-19534), the default value of the global variable srv_checksum_algorithm in innochecksum was changed from SRV_CHECKSUM_ALGORITHM_INNODB to implied 0 (innodb_checksum_algorithm=crc32). As a result, the function buf_page_is_corrupted() would by default invoke buf_calc_page_crc32() in innochecksum, and crc32_inited would hold. This would cause "innochecksum" to fail on a particular page. The actual problem is older, introduced in 2011 in mysql/mysql-server@17e497bdb7 (MySQL 5.6.3). It should affect the validation of pages of old data files that were written with innodb_checksum_algorithm=innodb. When using innodb_checksum_algorithm=crc32 (the default setting since MariaDB Server 10.2), some valid pages would be rejected only because exactly one of the two checksum fields accidentally matches the innodb_checksum_algorithm=crc32 value. buf_page_is_corrupted(): Simplify the logic of non-strict checksum validation, by always invoking buf_calc_page_crc32(). Remove a bogus condition that if only one of the checksum fields contains the value returned by buf_calc_page_crc32(), the page is corrupted.	2022-03-28 13:36:36 +03:00
Marko Mäkelä	e9e6db9355	Fix g++-12 -O2 -Wstringop-overflow buf_pool_t::watch_unset(): Reorder some code so that no warning will be emitted in CMAKE_BUILD_TYPE=RelWithDebInfo. It is unclear why invoking watch_is_sentinel() before accessing the block descriptor state would make the warning disappear.	2022-03-25 09:23:16 +02:00
Daniel Black	88ce8a3d8b	Merge 10.7 into 10.8	2022-03-25 15:06:56 +11:00
Daniel Black	8b92e346b1	Merge 10.6 into 10.7	2022-03-25 14:31:59 +11:00
Marko Mäkelä	8684af76e3	MDEV-28137 Some memory transactions are unnecessarily complex buf_page_get_zip(): Do not perform a system call inside a memory transaction. Instead, if the page latch is unavailable, abort the memory transaction and let the fall-back code path wait for the page latch. buf_pool_t::watch_remove(): Return the previous state of the block. buf_page_init_for_read(): Use regular stores for moving the buffer fix count of watch_remove() to the new block descriptor. A more extensive version of this was reviewed by Daniel Black and tested with Intel TSX-NI by Axel Schwenke and Matthias Leich. My assumption that regular loads and stores would execute faster in a memory transaction than operations like std::atomic::fetch_add() turned out to be incorrect.	2022-03-24 16:09:04 +02:00
Marko Mäkelä	1ecf173741	Merge 10.8 into 10.9	2022-03-15 18:26:29 +02:00
Marko Mäkelä	9f5a3e5689	Merge 10.7 into 10.8	2022-03-15 18:18:07 +02:00
Marko Mäkelä	dc4b7f382b	Merge 10.6 into 10.7	2022-03-15 15:25:31 +02:00
Marko Mäkelä	4ef44cc2f9	Merge 10.5 into 10.6	2022-03-15 14:49:24 +02:00
Marko Mäkelä	73fee39ea6	MDEV-27985 buf_flush_freed_pages() causes InnoDB to hang buf_flush_freed_pages(): Assert that neither buf_pool.mutex nor buf_pool.flush_list_mutex are held. Simplify the loops. Return the tablespace and the number of pages written or punched. buf_flush_LRU_list_batch(), buf_do_flush_list_batch(): Release buf_pool.mutex before invoking buf_flush_space(). buf_flush_list_space(): Acquire the mutexes only after invoking buf_flush_freed_pages(). Reviewed by: Thirunarayanan Balathandayuthapani	2022-03-15 14:44:22 +02:00
Marko Mäkelä	8575d2fb39	MDEV-28043 Race condition between mtr_t::commit() and checkpoint In commit `a635c40648` (MDEV-27774) a race condition was introduced between mtr_t::commit() and a log checkpoint. Between the time of assigning the log sequence number and adding the changed pages to buf_pool.flush_list, the log_sys.latch must be continuously held by the current thread, or otherwise a log checkpoint could get the wrong result from buf_pool.get_oldest_modification(). buf_pool_t::insert_into_flush_list(): Add a debug assertion for increasing the probability of cathing this type of problem. mtr_t::m_latch_ex: A flag that indicates whether the mini-transaction is holding log_sys.latch in exclusive mode. mtr_t::do_write(), mtr_t::finish_write(): Remove the parameter "bool ex" and refer to m_latch_ex instead. mtr_t::commit(): Release log_sys.latch according to m_latch_ex. mtr_t::commit_shrink(), mtr_t::commit_files(): Set m_latch_ex. mtr_t::do_write(): Do not release an exclusive log_sys.latch, but instead set m_latch_ex if needed.	2022-03-15 12:35:40 +02:00
Marko Mäkelä	18bb95b608	Merge 10.7 into 10.8	2022-03-14 11:52:11 +02:00
Marko Mäkelä	e67d46e4a1	Merge 10.6 into 10.7	2022-03-14 11:30:32 +02:00
Daniel Black	bd1ba7801f	Merge branch 10.5 into 10.6	2022-03-12 16:16:03 +11:00
Daniel Black	d78173828e	MDEV-27900: aio handle partial reads/writes As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can really confuse MariaDB. Filipe Manana (SuSE)[1] showed how database programmers can assume O_DIRECT is all or nothing. While a fix was done in the kernel side, we can do better in our code by requesting that the rest of the block be read/written synchronously if we do only get a partial read/write. Per the APIs, a partial read/write can occur before an error, so reattempting the request will leave the caller with a concrete error to handle. [1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a Also spell synchronously correctly in other files.	2022-03-12 09:47:53 +11:00
Marko Mäkelä	b95942a2a7	MDEV-27812: Fix a race condition, and pacify MemorySanitizer log_t::write_checkpoint(): Avoid a race condition with resize_abort() by loading log_sys.resize_lsn only while holding the locks. Also, remove an unnecessary write of all outstanding log before switching log files. log_t::write_buf(): Mark the entire block in the resize buffer as initialized, to allow the test innodb.log_file_size_online to pass when not using the mmap() based interface.	2022-03-11 11:10:09 +02:00
Marko Mäkelä	99e74478c8	Merge 10.8 into 10.9	2022-03-11 11:07:49 +02:00
Marko Mäkelä	1596ef738c	Merge 10.7 into 10.8	2022-03-11 10:49:49 +02:00
Marko Mäkelä	79bc654ac3	Merge 10.6 into 10.7	2022-03-11 10:48:58 +02:00
Marko Mäkelä	06ec439b8c	MDEV-27058 fixup: Relax a debug assertion buf_page_get_low(): Assert that the block not be read-fixed. It may be write-fixed while we only hold a shared latch on the page. Page writes are protected by U latches, which are compatible with S. In all other places where we assert that the block not be IO-fixed, we are holding U or X latch, which does prevent concurrent file I/O.	2022-03-10 15:23:28 +02:00
Marko Mäkelä	35fcae1040	Merge 10.8 into 10.9	2022-03-08 10:36:22 +02:00
Marko Mäkelä	e8a2a70cf8	Merge 10.7 into 10.8	2022-03-08 10:03:45 +02:00
Marko Mäkelä	af87186c1d	Merge 10.6 into 10.7	2022-03-08 09:51:31 +02:00
Daniel Black	b6a2472489	MDEV-27891: SIGSEGV in InnoDB buffer pool resize During an increase in resize, the new curr_size got a value less than old_size. As n_chunks_new and n_chunks have a strong correlation to the resizing operation in progress, we can use them and remove the need for old_size. For convienece the n_chunks_new < n_chunks is now the is_shrinking function. The volatile compiler optimization on n_chunks{,_new} is removed as real mutex uses are needed. Other n_chunks_new/n_chunks methods: n_chunks_new and n_chunks almost always read and altered under the pool mutex. Exceptions are: * i_s_innodb_buffer_page_fill, * buf_pool_t::is_uncompressed (via is_blocked_field) These need reexamining for the need of a mutex, however comments indicates this already. get_n_pages has uses in buffer pool load, recover log memory exhaustion estimates and innodb status so take the minimum number of chunks for safety. The buf_pool_t::running_out function also uses curr_size/old_size. We replace this hot function calculation with just n_chunks_new. This is the new size of the chunks before the resizing occurs. If we are resizing down, we've already got the case we had previously (as the minimum). If we are resizing upwards, we are taking an optimistic view that there will be buffer chunks available for locks. As this memory allocation is occurring immediately next the resizing function it seems likely. Compiler hint UNIV_UNLIKELY removed to leave it to the branch predictor to make an informed decision. Added test case of a smaller size than the Marko/Roel original in JIRA reducing the size to 256M. SEGV hits roughly 1/10 times but its better than a 21G memory size. Reviewer: Marko	2022-03-07 13:36:18 +11:00
Marko Mäkelä	177345dadc	MDEV-27812 Allow SET GLOBAL innodb_log_file_size We support online log resizing by replicating the current ib_logfile0 to a new file ib_logfile101, which will eventually replace the ib_logfile0 on the first applicable log checkpoint. Unless the log is located in a persistent memory file system (PMEM), an attempt to SET GLOBAL innodb_log_file_size to less than innodb_log_buffer_size will be refused. (With PMEM, a.k.a. mmap() based log, that parameter has no meaning.) Should the server be killed while the log was being resized, both files ib_logfile0 and ib_logfile101 may exist on startup, and since commit `3b06415cb8` the extra file ib_logfile101 will be removed. We will initiate checkpoint flushing by invoking buf_flush_ahead(), to let buf_flush_page_cleaner() write out pages until the buf_flush_async_lsn target has been reached. On a log checkpoint, if the new checkpoint LSN is not older than log_sys.resize_lsn (the start LSN of the ib_logfile101), we can switch files and complete the log resizing. Else, we will attempt to switch files on the next checkpoint. Log resizing can be aborted by killing the connection that is executing the SET GLOBAL statement. If the ib_logfile101 wraps around to the beginning, we must advance the log_sys.resize_lsn. In the resized log file, the sequence bit will always be written as 1 (no wrap-around). The log will be duplicated in log_t::resize_write(), invoked by mtr_t::finish_write(). When the log is being written via system calls (not PMEM), the initial log_sys.resize_lsn is the current log_sys.first_lsn, plus an integer multiple of log_sys.block_size, corresponding to the LSN at the start of the block that was written by log_sys.write_lsn. The log_sys.resize_buf will be of the same size as the log_sys.buf. During resizing, the contents of log_sys.buf and log_sys.resize_buf will be identical, except that the sequence bit of each mini-transaction will always be 1 in log_sys.resize_buf. If resizing is in progress, log_t::write_buf() will write log_sys.resize_buf to log_sys.resize_log (ib_logfile101). If the file would wrap around, the buffer will be written to log_sys.START_OFFSET and the log_sys.resize_lsn advanced accordingly. When using mmap() on /dev/shm or a PMEM mount -o dax file system, the initial log_sys.resize_lsn will be the log_sys.lsn at the time the resizing is initiated. If the log file wraps around during resizing, then the log_sys.resize_lsn will be advanced by (log_sys.resize_target - log_sys.START_OFFSET). log_t::resize_start(), log_t::resize_abort(), log_t::write_checkpoint(): Unless the log is mmap() based, acquire flush_lock and write_lock. In any case, acquire exclusive log_sys.latch to prevent race conditions. log_t::resize_rename(): Renamed from log_t::rename_resized(), and moved some code to the previous sole caller srv_start(). Thanks to Vladislav Vaintroub for helpful review comments and to Matthias Leich for testing this, in particular, testing crash recovery, multiple concurrent SET GLOBAL innodb_log_file_size and frequently killed connections.	2022-03-02 16:53:04 +02:00
Marko Mäkelä	66dd272572	MDEV-27910: Internal compiler error on CentOS 7 ARMv8 GCC 4.8.5 The build started to fail since `a635c40648`	2022-02-21 19:31:05 +02:00
Marko Mäkelä	1c5b099a96	MDEV-27876: SUX_LOCK_GENERIC build fails after MDEV-27774 The rw_lock_t wrapper does not define any is_locked() or is_write_locked() predicate. Therefore, we must add #ifndef SUX_LOCK_GENERIC before each debug assertion that asserts that log_sys.latch is being held by some thread (as an approximation for asserting that it is being held by the current thread).	2022-02-17 20:06:33 +02:00
Marko Mäkelä	f80deb9590	MDEV-27868 buf_pool.flush_list is in the wrong order buf_pool_t::insert_into_flush_list(): Remove any clean blocks that the buf_pool.flush_list may contain ever since commit `22b62edaed` (MDEV-25113). This fixes up commit `a635c40648` (MDEV-27774).	2022-02-17 19:38:17 +02:00
Marko Mäkelä	a635c40648	MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit() A prominent bottleneck in mtr_t::commit() is log_sys.mutex between log_sys.append_prepare() and log_close(). User-visible change: The minimum innodb_log_file_size will be increased from 1MiB to 4MiB so that some conditions can be trivially satisfied. log_sys.latch (log_latch): Replaces log_sys.mutex and log_sys.flush_order_mutex. Copying mtr_t::m_log to log_sys.buf is protected by a shared log_sys.latch. Writes from log_sys.buf to the file system will be protected by an exclusive log_sys.latch. log_sys.lsn_lock: Protects the allocation of log buffer in log_sys.append_prepare(). sspin_lock: A simple spin lock, for log_sys.lsn_lock. Thanks to Vladislav Vaintroub for suggesting this idea, and for reviewing these changes. mariadb-backup: Replace some use of log_sys.mutex with recv_sys.mutex. buf_pool_t::insert_into_flush_list(): Implement sorting of flush_list because ordering is otherwise no longer guaranteed. Ordering by LSN is needed for the proper operation of redo log checkpoints. log_sys.append_prepare(): Advance log_sys.lsn and log_sys.buf_free by the length, and return the old values. Also increment write_to_buf, which was previously done in log_close(). mtr_t::finish_write(): Obtain the buffer pointer from log_sys.append_prepare(). log_sys.buf_free: Make the field Atomic_relaxed, to simplify log_flush_margin(). Use only loads and stores to avoid costly read-modify-write atomic operations. buf_pool.flush_list_requests: Replaces export_vars.innodb_buffer_pool_write_requests and srv_stats.buf_pool_write_requests. Protected by buf_pool.flush_list_mutex. buf_pool_t::insert_into_flush_list(): Do not invoke page_cleaner_wakeup(). Let the caller do that after a batch of calls. recv_recover_page(): Invoke a minimal part of buf_pool.insert_into_flush_list(). ReleaseBlocks::modified: A number of pages added to buf_pool.flush_list. ReleaseBlocks::operator(): Merge buf_flush_note_modification() here. log_t::set_capacity(): Renamed from log_set_capacity().	2022-02-10 16:37:12 +02:00
Oleksandr Byelkin	4fb2cb1a30	Merge branch '10.7' into 10.8	2022-02-04 14:50:25 +01:00
Oleksandr Byelkin	9ed8deb656	Merge branch '10.6' into 10.7	2022-02-04 14:11:46 +01:00
Marko Mäkelä	82f5981e72	MDEV-27058 fixup: Crash in innodb.leaf_page_corrupted_during_recovery buf_page_get_low(): If the page was read-fixed, validate the page ID because the page could have been marked as corrupted. We should retry the page read in this case, instead of returning a soon-to-be-evicted corrupted page to the caller. This was initially only observed on Microsoft Windows. On Linux, this was repeated after adding a sleep to buf_pool_t::corrupted_evict() between bpage->zip.fix.fetch_sub() and bpage->lock.x_unlock().	2022-02-03 17:02:27 +01:00
Oleksandr Byelkin	f5c5f8e41e	Merge branch '10.5' into 10.6	2022-02-03 17:01:31 +01:00
Oleksandr Byelkin	cf63eecef4	Merge branch '10.4' into 10.5	2022-02-01 20:33:04 +01:00
Oleksandr Byelkin	a576a1cea5	Merge branch '10.3' into 10.4	2022-01-30 09:46:52 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Haidong Ji	d0ca235d16	MDEV-27314 InnoDB Buffer Pool Resize output cleanup Cleaned up the log messages as suggested, with a minor code formatting change. On bullet point 13, I decided to not include timestamp in output message. In most (all?) cases, the output goes to the log file, which has timestamp already.	2022-01-24 11:14:26 +11:00
Marko Mäkelä	88d9fbb484	Disable adaptive spinning on buf_pool.mutex During the testing of MDEV-14425, buf_pool.mutex and log_sys.mutex were identified as the main bottlenecks for write workloads. Let us disable spinning also for buf_pool.mutex, except on ARMv8 where spinning was enabled for log_sys.mutex in commit `f7684f0ca5` (MDEV-26855). This was tested on AMD64 and recommended by Axel Schwenke. According to Krunal Bauskar, removing the spinloops did not improve performance in his tests on ARMv8.	2022-01-21 16:13:28 +02:00
Marko Mäkelä	5d54fd611f	Cleanup: Replace ut_crc32c(x,y) with my_crc32c(0,x,y)	2022-01-21 16:13:04 +02:00
Marko Mäkelä	685d958e38	MDEV-14425 Improve the redo log for concurrency The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.	2022-01-21 16:03:47 +02:00
Thirunarayanan Balathandayuthapani	28e166d643	MDEV-26784 [Warning] InnoDB: Difficult to find free blocks in the buffer pool Problem: ======= InnoDB ran out of memory during recovery and it fails to flush the dirty LRU blocks. The reason is that buffer pool can ran out before the LRU list length reaches BUF_LRU_OLD_MIN_LEN(256) threshold. Fix: ==== During recovery, InnoDB should write out and evict all dirty blocks.	2022-01-21 14:15:18 +05:30
Marko Mäkelä	a855d6d93a	Merge 10.7 into 10.8	2022-01-20 08:24:12 +02:00

... 5 6 7 8 9 ...

1605 commits