mariadb/storage/innobase/buf/buf0buddy.cc

770 lines
22 KiB
C++
Raw Normal View History

/*****************************************************************************
2016-06-21 14:21:03 +02:00
Copyright (c) 2006, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2018, 2021, MariaDB Corporation.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
2019-05-11 19:25:02 +03:00
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
*****************************************************************************/
/**************************************************//**
@file buf/buf0buddy.cc
Binary buddy allocator for compressed pages
Created December 2006 by Marko Makela
*******************************************************/
#include "buf0buddy.h"
#include "buf0buf.h"
#include "buf0lru.h"
#include "buf0flu.h"
#include "page0zip.h"
#include "srv0start.h"
/** When freeing a buf we attempt to coalesce by looking at its buddy
and deciding whether it is free or not. To ascertain if the buddy is
free we look for BUF_BUDDY_STAMP_FREE at BUF_BUDDY_STAMP_OFFSET
within the buddy. The question is how we can be sure that it is
safe to look at BUF_BUDDY_STAMP_OFFSET.
The answer lies in following invariants:
* All blocks allocated by buddy allocator are used for compressed
page frame.
* A compressed table always have space_id < SRV_SPACE_ID_UPPER_BOUND
* BUF_BUDDY_STAMP_OFFSET always points to the space_id field in
a frame.
-- The above is true because we look at these fields when the
corresponding buddy block is free which implies that:
* The block we are looking at must have an address aligned at
the same size that its free buddy has. For example, if we have
a free block of 8K then its buddy's address must be aligned at
8K as well.
* It is possible that the block we are looking at may have been
further divided into smaller sized blocks but its starting
address must still remain the start of a page frame i.e.: it
cannot be middle of a block. For example, if we have a free
block of size 8K then its buddy may be divided into blocks
of, say, 1K, 1K, 2K, 4K but the buddy's address will still be
the starting address of first 1K compressed page.
* What is important to note is that for any given block, the
buddy's address cannot be in the middle of a larger block i.e.:
in above example, our 8K block cannot have a buddy whose address
is aligned on 8K but it is part of a larger 16K block.
*/
/** Offset within buf_buddy_free_t where free or non_free stamps
are written.*/
#define BUF_BUDDY_STAMP_OFFSET FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID
/** Value that we stamp on all buffers that are currently on the zip_free
list. This value is stamped at BUF_BUDDY_STAMP_OFFSET offset */
#define BUF_BUDDY_STAMP_FREE SRV_SPACE_ID_UPPER_BOUND
/** Stamp value for non-free buffers. Will be overwritten by a non-zero
value by the consumer of the block */
#define BUF_BUDDY_STAMP_NONFREE 0XFFFFFFFFUL
/** Return type of buf_buddy_is_free() */
enum buf_buddy_state_t {
BUF_BUDDY_STATE_FREE, /*!< If the buddy to completely free */
BUF_BUDDY_STATE_USED, /*!< Buddy currently in used */
BUF_BUDDY_STATE_PARTIALLY_USED/*!< Some sub-blocks in the buddy
are in use */
};
/**********************************************************************//**
Invalidate memory area that we won't access while page is free */
UNIV_INLINE
void
buf_buddy_mem_invalid(
/*==================*/
buf_buddy_free_t* buf, /*!< in: block to check */
ulint i) /*!< in: index of zip_free[] */
{
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
ut_ad(i <= BUF_BUDDY_SIZES);
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_CHECK_ADDRESSABLE(buf, BUF_BUDDY_LOW << i);
MEM_UNDEFINED(buf, BUF_BUDDY_LOW << i);
}
/**********************************************************************//**
Check if a buddy is stamped free.
@return whether the buddy is free */
2016-06-21 14:21:03 +02:00
UNIV_INLINE MY_ATTRIBUTE((warn_unused_result))
bool
buf_buddy_stamp_is_free(
/*====================*/
const buf_buddy_free_t* buf) /*!< in: block to check */
{
compile_time_assert(BUF_BUDDY_STAMP_FREE < BUF_BUDDY_STAMP_NONFREE);
return(mach_read_from_4(buf->stamp.bytes + BUF_BUDDY_STAMP_OFFSET)
== BUF_BUDDY_STAMP_FREE);
}
/**********************************************************************//**
Stamps a buddy free. */
UNIV_INLINE
void
buf_buddy_stamp_free(
/*=================*/
buf_buddy_free_t* buf, /*!< in/out: block to stamp */
ulint i) /*!< in: block size */
{
ut_d(memset(&buf->stamp.bytes, int(i), BUF_BUDDY_LOW << i));
buf_buddy_mem_invalid(buf, i);
mach_write_to_4(buf->stamp.bytes + BUF_BUDDY_STAMP_OFFSET,
BUF_BUDDY_STAMP_FREE);
buf->stamp.size = i;
}
/**********************************************************************//**
Stamps a buddy nonfree.
@param[in,out] buf block to stamp
@param[in] i block size */
static inline void buf_buddy_stamp_nonfree(buf_buddy_free_t* buf, ulint i)
{
buf_buddy_mem_invalid(buf, i);
compile_time_assert(BUF_BUDDY_STAMP_NONFREE == 0xffffffffU);
memset(buf->stamp.bytes + BUF_BUDDY_STAMP_OFFSET, 0xff, 4);
}
/**********************************************************************//**
Get the offset of the buddy of a compressed page frame.
@return the buddy relative of page */
UNIV_INLINE
void*
buf_buddy_get(
/*==========*/
byte* page, /*!< in: compressed page */
ulint size) /*!< in: page size in bytes */
{
ut_ad(ut_is_2pow(size));
ut_ad(size >= BUF_BUDDY_LOW);
ut_ad(BUF_BUDDY_LOW <= UNIV_ZIP_SIZE_MIN);
ut_ad(size < BUF_BUDDY_HIGH);
ut_ad(BUF_BUDDY_HIGH == srv_page_size);
ut_ad(!ut_align_offset(page, size));
if (((ulint) page) & size) {
return(page - size);
} else {
return(page + size);
}
}
#ifdef UNIV_DEBUG
/** Validate a given zip_free list. */
struct CheckZipFree {
CheckZipFree(ulint i) : m_i(i) {}
void operator()(const buf_buddy_free_t* elem) const
{
ut_ad(buf_buddy_stamp_is_free(elem));
ut_ad(elem->stamp.size <= m_i);
}
const ulint m_i;
};
/** Validate a buddy list.
@param[in] i buddy size to validate */
static void buf_buddy_list_validate(ulint i)
{
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_list_validate(buf_pool.zip_free[i], CheckZipFree(i));
}
/**********************************************************************//**
Debug function to validate that a buffer is indeed free i.e.: in the
zip_free[].
@param[in] buf block to check
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[]
@return true if free */
static bool buf_buddy_check_free(const buf_buddy_free_t* buf, ulint i)
{
const ulint size = BUF_BUDDY_LOW << i;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(!ut_align_offset(buf, size));
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
buf_buddy_free_t* itr;
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
for (itr = UT_LIST_GET_FIRST(buf_pool.zip_free[i]);
itr && itr != buf;
itr = UT_LIST_GET_NEXT(list, itr)) {
}
return(itr == buf);
}
#endif /* UNIV_DEBUG */
/**********************************************************************//**
Checks if a buf is free i.e.: in the zip_free[].
@retval BUF_BUDDY_STATE_FREE if fully free
@retval BUF_BUDDY_STATE_USED if currently in use
@retval BUF_BUDDY_STATE_PARTIALLY_USED if partially in use. */
2016-06-21 14:21:03 +02:00
static MY_ATTRIBUTE((warn_unused_result))
buf_buddy_state_t
buf_buddy_is_free(
/*==============*/
buf_buddy_free_t* buf, /*!< in: block to check */
ulint i) /*!< in: index of
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_pool.zip_free[] */
{
#ifdef UNIV_DEBUG
const ulint size = BUF_BUDDY_LOW << i;
ut_ad(!ut_align_offset(buf, size));
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
#endif /* UNIV_DEBUG */
/* We assume that all memory from buf_buddy_alloc()
is used for compressed page frames. */
/* We look inside the allocated objects returned by
buf_buddy_alloc() and assume that each block is a compressed
page that contains one of the following in space_id.
* BUF_BUDDY_STAMP_FREE if the block is in a zip_free list or
* BUF_BUDDY_STAMP_NONFREE if the block has been allocated but
not initialized yet or
* A valid space_id of a compressed tablespace
The call below attempts to read from free memory. The memory
is "owned" by the buddy allocator (and it has been allocated
from the buffer pool), so there is nothing wrong about this. */
if (!buf_buddy_stamp_is_free(buf)) {
return(BUF_BUDDY_STATE_USED);
}
/* A block may be free but a fragment of it may still be in use.
To guard against that we write the free block size in terms of
zip_free index at start of stamped block. Note that we can
safely rely on this value only if the buf is free. */
ut_ad(buf->stamp.size <= i);
return(buf->stamp.size == i
? BUF_BUDDY_STATE_FREE
: BUF_BUDDY_STATE_PARTIALLY_USED);
}
/** Add a block to the head of the appropriate buddy free list.
@param[in,out] buf block to be freed
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[] */
UNIV_INLINE
void
buf_buddy_add_to_free(buf_buddy_free_t* buf, ulint i)
{
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(buf_pool.zip_free[i].start != buf);
buf_buddy_stamp_free(buf, i);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
UT_LIST_ADD_FIRST(buf_pool.zip_free[i], buf);
ut_d(buf_buddy_list_validate(i));
}
/** Remove a block from the appropriate buddy free list.
@param[in,out] buf block to be freed
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[] */
UNIV_INLINE
void
buf_buddy_remove_from_free(buf_buddy_free_t* buf, ulint i)
{
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(buf_buddy_check_free(buf, i));
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
UT_LIST_REMOVE(buf_pool.zip_free[i], buf);
buf_buddy_stamp_nonfree(buf, i);
}
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
/** Try to allocate a block from buf_pool.zip_free[].
@param[in] i index of buf_pool.zip_free[]
@return allocated block, or NULL if buf_pool.zip_free[] was empty */
static buf_buddy_free_t* buf_buddy_alloc_zip(ulint i)
{
buf_buddy_free_t* buf;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_a(i < BUF_BUDDY_SIZES);
ut_a(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
ut_d(buf_buddy_list_validate(i));
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf = UT_LIST_GET_FIRST(buf_pool.zip_free[i]);
MDEV-27891: SIGSEGV in InnoDB buffer pool resize During an increase in resize, the new curr_size got a value less than old_size. As n_chunks_new and n_chunks have a strong correlation to the resizing operation in progress, we can use them and remove the need for old_size. For convienece the n_chunks_new < n_chunks is now the is_shrinking function. The volatile compiler optimization on n_chunks{,_new} is removed as real mutex uses are needed. Other n_chunks_new/n_chunks methods: n_chunks_new and n_chunks almost always read and altered under the pool mutex. Exceptions are: * i_s_innodb_buffer_page_fill, * buf_pool_t::is_uncompressed (via is_blocked_field) These need reexamining for the need of a mutex, however comments indicates this already. get_n_pages has uses in buffer pool load, recover log memory exhaustion estimates and innodb status so take the minimum number of chunks for safety. The buf_pool_t::running_out function also uses curr_size/old_size. We replace this hot function calculation with just n_chunks_new. This is the new size of the chunks before the resizing occurs. If we are resizing down, we've already got the case we had previously (as the minimum). If we are resizing upwards, we are taking an optimistic view that there will be buffer chunks available for locks. As this memory allocation is occurring immediately next the resizing function it seems likely. Compiler hint UNIV_UNLIKELY removed to leave it to the branch predictor to make an informed decision. Added test case of a smaller size than the Marko/Roel original in JIRA reducing the size to 256M. SEGV hits roughly 1/10 times but its better than a 21G memory size. Reviewer: Marko
2022-02-22 17:42:59 +11:00
if (buf_pool.is_shrinking()
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
&& UT_LIST_GET_LEN(buf_pool.withdraw)
< buf_pool.withdraw_target) {
while (buf != NULL
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
&& buf_pool.will_be_withdrawn(
reinterpret_cast<byte*>(buf))) {
/* This should be withdrawn, not to be allocated */
buf = UT_LIST_GET_NEXT(list, buf);
}
}
if (buf) {
buf_buddy_remove_from_free(buf, i);
} else if (i + 1 < BUF_BUDDY_SIZES) {
/* Attempt to split. */
buf = buf_buddy_alloc_zip(i + 1);
if (buf) {
buf_buddy_free_t* buddy =
reinterpret_cast<buf_buddy_free_t*>(
reinterpret_cast<byte*>(buf)
+ (BUF_BUDDY_LOW << i));
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(!buf_pool.contains_zip(buddy));
buf_buddy_add_to_free(buddy, i);
}
}
if (buf) {
/* Trash the page other than the BUF_BUDDY_STAMP_NONFREE. */
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_UNDEFINED(buf, BUF_BUDDY_STAMP_OFFSET);
MEM_UNDEFINED(BUF_BUDDY_STAMP_OFFSET + 4 + buf->stamp.bytes,
(BUF_BUDDY_LOW << i)
- (BUF_BUDDY_STAMP_OFFSET + 4));
ut_ad(mach_read_from_4(buf->stamp.bytes
+ BUF_BUDDY_STAMP_OFFSET)
== BUF_BUDDY_STAMP_NONFREE);
}
return(buf);
}
/** Deallocate a buffer frame of srv_page_size.
@param[in] buf buffer frame to deallocate */
static
void
buf_buddy_block_free(void* buf)
{
const ulint fold = BUF_POOL_ZIP_FOLD_PTR(buf);
buf_page_t* bpage;
buf_block_t* block;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_a(!ut_align_offset(buf, srv_page_size));
HASH_SEARCH(hash, &buf_pool.zip_hash, fold, buf_page_t*, bpage,
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
ut_ad(bpage->state() == buf_page_t::MEMORY
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
&& bpage->in_zip_hash),
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
bpage->frame == buf);
ut_a(bpage);
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
ut_a(bpage->state() == buf_page_t::MEMORY);
ut_ad(bpage->in_zip_hash);
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
ut_d(bpage->in_zip_hash = false);
HASH_DELETE(buf_page_t, hash, &buf_pool.zip_hash, fold, bpage);
bpage->hash = nullptr;
ut_d(memset(buf, 0, srv_page_size));
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_UNDEFINED(buf, srv_page_size);
block = (buf_block_t*) bpage;
buf_LRU_block_free_non_file_page(block);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(buf_pool.buddy_n_frames > 0);
ut_d(buf_pool.buddy_n_frames--);
}
/**********************************************************************//**
Allocate a buffer block to the buddy allocator. */
static
void
buf_buddy_block_register(
/*=====================*/
buf_block_t* block) /*!< in: buffer frame to allocate */
{
const ulint fold = BUF_POOL_ZIP_FOLD(block);
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
ut_ad(block->page.state() == buf_page_t::MEMORY);
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
ut_a(block->page.frame);
ut_a(!ut_align_offset(block->page.frame, srv_page_size));
ut_ad(!block->page.in_zip_hash);
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
ut_d(block->page.in_zip_hash = true);
HASH_INSERT(buf_page_t, hash, &buf_pool.zip_hash, fold, &block->page);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_d(buf_pool.buddy_n_frames++);
}
/** Allocate a block from a bigger object.
@param[in] buf a block that is free to use
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[]
@param[in] j size of buf as an index of buf_pool.zip_free[]
@return allocated block */
static
void*
buf_buddy_alloc_from(void* buf, ulint i, ulint j)
{
ulint offs = BUF_BUDDY_LOW << j;
ut_ad(j <= BUF_BUDDY_SIZES);
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
ut_ad(j >= i);
ut_ad(!ut_align_offset(buf, offs));
/* Add the unused parts of the block to the free lists. */
while (j > i) {
buf_buddy_free_t* zip_buf;
offs >>= 1;
j--;
zip_buf = reinterpret_cast<buf_buddy_free_t*>(
reinterpret_cast<byte*>(buf) + offs);
buf_buddy_add_to_free(zip_buf, j);
}
buf_buddy_stamp_nonfree(reinterpret_cast<buf_buddy_free_t*>(buf), i);
return(buf);
}
/** Allocate a ROW_FORMAT=COMPRESSED block.
@param i index of buf_pool.zip_free[] or BUF_BUDDY_SIZES
@param lru assigned to true if buf_pool.mutex was temporarily released
@return allocated block, never NULL */
byte *buf_buddy_alloc_low(ulint i, bool *lru)
{
buf_block_t* block;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
if (i < BUF_BUDDY_SIZES) {
/* Try to allocate from the buddy system. */
block = (buf_block_t*) buf_buddy_alloc_zip(i);
if (block) {
goto func_exit;
}
}
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
/* Try allocating from the buf_pool.free list. */
block = buf_LRU_get_free_only();
if (block) {
goto alloc_big;
}
/* Try replacing an uncompressed page in the buffer pool. */
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
block = buf_LRU_get_free_block(true);
if (lru) {
*lru = true;
}
alloc_big:
buf_buddy_block_register(block);
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
block = reinterpret_cast<buf_block_t*>(
buf_buddy_alloc_from(block->page.frame, i, BUF_BUDDY_SIZES));
func_exit:
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_pool.buddy_stat[i].used++;
return reinterpret_cast<byte*>(block);
}
/** Try to relocate a block. The caller must hold zip_free_mutex, and this
function will release and lock it again.
@param[in] src block to relocate
@param[in] dst free block to relocated to
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[]
@param[in] force true if we must relocated always
@return true if relocated */
static bool buf_buddy_relocate(void* src, void* dst, ulint i, bool force)
{
buf_page_t* bpage;
const ulint size = BUF_BUDDY_LOW << i;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(!ut_align_offset(src, size));
ut_ad(!ut_align_offset(dst, size));
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_CHECK_ADDRESSABLE(dst, size);
uint32_t space = mach_read_from_4(static_cast<const byte*>(src)
+ FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
uint32_t offset = mach_read_from_4(static_cast<const byte*>(src)
+ FIL_PAGE_OFFSET);
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
/* Suppress Valgrind or MSAN warnings. */
MEM_MAKE_DEFINED(&space, sizeof space);
MEM_MAKE_DEFINED(&offset, sizeof offset);
ut_ad(space != BUF_BUDDY_STAMP_FREE);
const page_id_t page_id(space, offset);
/* FIXME: we are computing this while holding buf_pool.mutex */
auto &cell= buf_pool.page_hash.cell_get(page_id.fold());
bpage = buf_pool.page_hash.get(page_id, cell);
if (!bpage || bpage->zip.data != src) {
/* The block has probably been freshly
allocated by buf_LRU_get_free_block() but not
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
added to buf_pool.page_hash yet. Obviously,
it cannot be relocated. */
if (!force || space != 0 || offset != 0) {
return(false);
}
/* It might be just uninitialized page.
We should search from LRU list also. */
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
bpage = UT_LIST_GET_FIRST(buf_pool.LRU);
while (bpage != NULL) {
if (bpage->zip.data == src) {
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
ut_ad(bpage->id() == page_id);
break;
}
bpage = UT_LIST_GET_NEXT(LRU, bpage);
}
if (bpage == NULL) {
return(false);
}
}
if (page_zip_get_size(&bpage->zip) != size) {
/* The block is of different size. We would
have to relocate all blocks covered by src.
For the sake of simplicity, give up. */
ut_ad(page_zip_get_size(&bpage->zip) < size);
return(false);
}
/* The block must have been allocated, but it may
contain uninitialized data. */
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_CHECK_ADDRESSABLE(src, size);
if (!bpage->can_relocate()) {
return false;
}
page_hash_latch &hash_lock = buf_pool.page_hash.lock_get(cell);
/* It does not make sense to use transactional_lock_guard here,
because the memcpy() of 1024 to 16384 bytes would likely make the
memory transaction too large. */
hash_lock.lock();
MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858 which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit c18084f71b02ea707c6461353e6cfc15d7553bc6 and recently removed by us in commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
if (bpage->can_relocate()) {
/* Relocate the compressed page. */
const ulonglong ns = my_interval_timer();
2014-05-05 18:20:28 +02:00
ut_a(bpage->zip.data == src);
2014-05-05 18:20:28 +02:00
memcpy(dst, src, size);
2014-05-05 18:20:28 +02:00
bpage->zip.data = reinterpret_cast<page_zip_t*>(dst);
hash_lock.unlock();
2014-05-05 18:20:28 +02:00
buf_buddy_mem_invalid(
reinterpret_cast<buf_buddy_free_t*>(src), i);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_buddy_stat_t* buddy_stat = &buf_pool.buddy_stat[i];
buddy_stat->relocated++;
buddy_stat->relocated_usec+= (my_interval_timer() - ns) / 1000;
return(true);
}
hash_lock.unlock();
2014-05-05 18:20:28 +02:00
return(false);
}
/** Deallocate a block.
@param[in] buf block to be freed, must not be pointed to
by the buffer pool
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
@param[in] i index of buf_pool.zip_free[], or BUF_BUDDY_SIZES */
void buf_buddy_free_low(void* buf, ulint i)
{
buf_buddy_free_t* buddy;
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(i <= BUF_BUDDY_SIZES);
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(buf_pool.buddy_stat[i].used > 0);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_pool.buddy_stat[i].used--;
recombine:
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
2020-07-01 17:23:00 +03:00
MEM_UNDEFINED(buf, BUF_BUDDY_LOW << i);
if (i == BUF_BUDDY_SIZES) {
buf_buddy_block_free(buf);
return;
}
ut_ad(i < BUF_BUDDY_SIZES);
ut_ad(buf == ut_align_down(buf, BUF_BUDDY_LOW << i));
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(!buf_pool.contains_zip(buf));
/* Do not recombine blocks if there are few free blocks.
We may waste up to 15360*max_len bytes to free blocks
(1024 + 2048 + 4096 + 8192 = 15360) */
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
if (UT_LIST_GET_LEN(buf_pool.zip_free[i]) < 16
MDEV-27891: SIGSEGV in InnoDB buffer pool resize During an increase in resize, the new curr_size got a value less than old_size. As n_chunks_new and n_chunks have a strong correlation to the resizing operation in progress, we can use them and remove the need for old_size. For convienece the n_chunks_new < n_chunks is now the is_shrinking function. The volatile compiler optimization on n_chunks{,_new} is removed as real mutex uses are needed. Other n_chunks_new/n_chunks methods: n_chunks_new and n_chunks almost always read and altered under the pool mutex. Exceptions are: * i_s_innodb_buffer_page_fill, * buf_pool_t::is_uncompressed (via is_blocked_field) These need reexamining for the need of a mutex, however comments indicates this already. get_n_pages has uses in buffer pool load, recover log memory exhaustion estimates and innodb status so take the minimum number of chunks for safety. The buf_pool_t::running_out function also uses curr_size/old_size. We replace this hot function calculation with just n_chunks_new. This is the new size of the chunks before the resizing occurs. If we are resizing down, we've already got the case we had previously (as the minimum). If we are resizing upwards, we are taking an optimistic view that there will be buffer chunks available for locks. As this memory allocation is occurring immediately next the resizing function it seems likely. Compiler hint UNIV_UNLIKELY removed to leave it to the branch predictor to make an informed decision. Added test case of a smaller size than the Marko/Roel original in JIRA reducing the size to 256M. SEGV hits roughly 1/10 times but its better than a 21G memory size. Reviewer: Marko
2022-02-22 17:42:59 +11:00
&& !buf_pool.is_shrinking()) {
goto func_exit;
}
/* Try to combine adjacent blocks. */
buddy = reinterpret_cast<buf_buddy_free_t*>(
buf_buddy_get(reinterpret_cast<byte*>(buf),
BUF_BUDDY_LOW << i));
switch (buf_buddy_is_free(buddy, i)) {
case BUF_BUDDY_STATE_FREE:
/* The buddy is free: recombine */
buf_buddy_remove_from_free(buddy, i);
buddy_is_free:
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
ut_ad(!buf_pool.contains_zip(buddy));
i++;
buf = ut_align_down(buf, BUF_BUDDY_LOW << i);
goto recombine;
case BUF_BUDDY_STATE_USED:
ut_d(buf_buddy_list_validate(i));
/* The buddy is not free. Is there a free block of
this size? */
if (buf_buddy_free_t* zip_buf =
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
UT_LIST_GET_FIRST(buf_pool.zip_free[i])) {
/* Remove the block from the free list, because
a successful buf_buddy_relocate() will overwrite
zip_free->list. */
buf_buddy_remove_from_free(zip_buf, i);
/* Try to relocate the buddy of buf to the free
block. */
if (buf_buddy_relocate(buddy, zip_buf, i, false)) {
goto buddy_is_free;
}
buf_buddy_add_to_free(zip_buf, i);
}
break;
case BUF_BUDDY_STATE_PARTIALLY_USED:
/* Some sub-blocks in the buddy are still in use.
Relocation will fail. No need to try. */
break;
}
func_exit:
/* Free the block to the buddy list. */
buf_buddy_add_to_free(reinterpret_cast<buf_buddy_free_t*>(buf), i);
}
/** Try to reallocate a block.
@param[in] buf buf_pool block to be reallocated
@param[in] size block size, up to srv_page_size
@return whether the reallocation succeeded */
bool
buf_buddy_realloc(void* buf, ulint size)
{
buf_block_t* block = NULL;
ulint i = buf_buddy_get_slot(size);
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
ut_ad(i <= BUF_BUDDY_SIZES);
ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
if (i < BUF_BUDDY_SIZES) {
/* Try to allocate from the buddy system. */
block = reinterpret_cast<buf_block_t*>(buf_buddy_alloc_zip(i));
}
if (block == NULL) {
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
/* Try allocating from the buf_pool.free list. */
block = buf_LRU_get_free_only();
if (block == NULL) {
return(false); /* free_list was not enough */
}
buf_buddy_block_register(block);
block = reinterpret_cast<buf_block_t*>(
buf_buddy_alloc_from(
MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
2021-11-16 19:55:06 +02:00
block->page.frame, i, BUF_BUDDY_SIZES));
}
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_pool.buddy_stat[i].used++;
/* Try to relocate the buddy of buf to the free block. */
if (buf_buddy_relocate(buf, block, i, true)) {
/* succeeded */
buf_buddy_free_low(buf, i);
} else {
/* failed */
buf_buddy_free_low(block, i);
}
return(true); /* free_list was enough */
}
/** Combine all pairs of free buddies. */
void buf_buddy_condense_free()
{
MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
2020-10-15 12:10:42 +03:00
mysql_mutex_assert_owner(&buf_pool.mutex);
MDEV-27891: SIGSEGV in InnoDB buffer pool resize During an increase in resize, the new curr_size got a value less than old_size. As n_chunks_new and n_chunks have a strong correlation to the resizing operation in progress, we can use them and remove the need for old_size. For convienece the n_chunks_new < n_chunks is now the is_shrinking function. The volatile compiler optimization on n_chunks{,_new} is removed as real mutex uses are needed. Other n_chunks_new/n_chunks methods: n_chunks_new and n_chunks almost always read and altered under the pool mutex. Exceptions are: * i_s_innodb_buffer_page_fill, * buf_pool_t::is_uncompressed (via is_blocked_field) These need reexamining for the need of a mutex, however comments indicates this already. get_n_pages has uses in buffer pool load, recover log memory exhaustion estimates and innodb status so take the minimum number of chunks for safety. The buf_pool_t::running_out function also uses curr_size/old_size. We replace this hot function calculation with just n_chunks_new. This is the new size of the chunks before the resizing occurs. If we are resizing down, we've already got the case we had previously (as the minimum). If we are resizing upwards, we are taking an optimistic view that there will be buffer chunks available for locks. As this memory allocation is occurring immediately next the resizing function it seems likely. Compiler hint UNIV_UNLIKELY removed to leave it to the branch predictor to make an informed decision. Added test case of a smaller size than the Marko/Roel original in JIRA reducing the size to 256M. SEGV hits roughly 1/10 times but its better than a 21G memory size. Reviewer: Marko
2022-02-22 17:42:59 +11:00
ut_ad(buf_pool.is_shrinking());
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
for (ulint i = 0; i < UT_ARR_SIZE(buf_pool.zip_free); ++i) {
buf_buddy_free_t* buf =
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
UT_LIST_GET_FIRST(buf_pool.zip_free[i]);
/* seek to withdraw target */
while (buf != NULL
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
&& !buf_pool.will_be_withdrawn(
reinterpret_cast<byte*>(buf))) {
buf = UT_LIST_GET_NEXT(list, buf);
}
while (buf != NULL) {
buf_buddy_free_t* next =
UT_LIST_GET_NEXT(list, buf);
buf_buddy_free_t* buddy =
reinterpret_cast<buf_buddy_free_t*>(
buf_buddy_get(
reinterpret_cast<byte*>(buf),
BUF_BUDDY_LOW << i));
/* seek to the next withdraw target */
while (true) {
while (next != NULL
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
&& !buf_pool.will_be_withdrawn(
reinterpret_cast<byte*>(next))) {
next = UT_LIST_GET_NEXT(list, next);
}
if (buddy != next) {
break;
}
next = UT_LIST_GET_NEXT(list, next);
}
if (buf_buddy_is_free(buddy, i)
== BUF_BUDDY_STATE_FREE) {
/* Both buf and buddy are free.
Try to combine them. */
buf_buddy_remove_from_free(buf, i);
MDEV-21962 Allocate buf_pool statically Thanks to MDEV-15058, there is only one InnoDB buffer pool. Allocating buf_pool statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of some buf_pool members. While doing this, we will also declare some buf_pool_t data members private and replace some functions with member functions. This is mostly affecting buffer pool resizing. This is not aiming to be a complete rewrite of buf_pool_t to a proper class. Most of the buffer pool interface, such as buf_page_get_gen(), will remain in the C programming style for now. buf_pool_t::withdrawing: Replaces buf_pool_withdrawing. buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock. buf_pool_t::create(): Repalces buf_pool_init(). buf_pool_t::close(): Replaces buf_pool_free(). buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(), buf_frame_will_be_withdrawn(). buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index(). buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages(). buf_pool_t::validate(): Replaces buf_validate(). buf_pool_t::print(): Replaces buf_print(). buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi(). buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field(). buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex(). buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock(). buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete(). buf_pool_t::io_buf: Make default-constructible. buf_pool_t::io_buf::create(): Delayed 'constructor' buf_pool_t::io_buf::close(): Early 'destructor' HazardPointer: Make default-constructible. Define all member functions inline, also for derived classes.
2020-03-18 21:48:00 +02:00
buf_pool.buddy_stat[i].used++;
buf_buddy_free_low(buf, i);
}
buf = next;
}
}
}