mariadb/storage/innobase/include/btr0sea.h
Marko Mäkelä b6923420f3 MDEV-29445: Reimplement SET GLOBAL innodb_buffer_pool_size
We deprecate and ignore the parameter innodb_buffer_pool_chunk_size
and let the buffer pool size to be changed in arbitrary 1-megabyte
increments.

innodb_buffer_pool_size_max: A new read-only startup parameter
that specifies the maximum innodb_buffer_pool_size.  If 0 or
unspecified, it will default to the specified innodb_buffer_pool_size
rounded up to the allocation unit (2 MiB or 8 MiB).  The maximum value
is 4GiB-2MiB on 32-bit systems and 16EiB-8MiB on 64-bit systems.
This maximum is very likely to be limited further by the operating system.

The status variable Innodb_buffer_pool_resize_status will reflect
the status of shrinking the buffer pool. When no shrinking is in
progress, the string will be empty.

Unlike before, the execution of SET GLOBAL innodb_buffer_pool_size
will block until the requested buffer pool size change has been
implemented, or the execution is interrupted by a KILL statement
a client disconnect, or server shutdown.  If the
buf_flush_page_cleaner() thread notices that we are running out of
memory, the operation may fail with ER_WRONG_USAGE.

SET GLOBAL innodb_buffer_pool_size will be refused
if the server was started with --large-pages (even if
no HugeTLB pages were successfully allocated). This functionality
is somewhat exercised by the test main.large_pages, which now runs
also on Microsoft Windows.  On Linux, explicit HugeTLB mappings are
apparently excluded from the reported Redident Set Size (RSS), and
apparently unshrinkable between mmap(2) and munmap(2).

The buffer pool will be mapped to a contiguous virtual memory area
that will be aligned and partitioned into extents of 8 MiB on
64-bit systems and 2 MiB on 32-bit systems.

Within an extent, the first few innodb_page_size blocks contain
buf_block_t objects that will cover the page frames in the rest
of the extent.  The number of such frames is precomputed in the
array first_page_in_extent[] for each innodb_page_size.
In this way, there is a trivial mapping between
page frames and block descriptors and we do not need any
lookup tables like buf_pool.zip_hash or buf_pool_t::chunk_t::map.

We will always allocate the same number of block descriptors for
an extent, even if we do not need all the buf_block_t in the last
extent in case the innodb_buffer_pool_size is not an integer multiple
of the of extents size.

The minimum innodb_buffer_pool_size is 256*5/4 pages.  At the default
innodb_page_size=16k this corresponds to 5 MiB.  However, now that the
innodb_buffer_pool_size includes the memory allocated for the block
descriptors, the minimum would be innodb_buffer_pool_size=6m.

my_large_virtual_alloc(): A new function, similar to my_large_malloc().

my_virtual_mem_reserve(), my_virtual_mem_commit(),
my_virtual_mem_decommit(), my_virtual_mem_release():
New interface mostly by Vladislav Vaintroub, to separately
reserve and release virtual address space, as well as to
commit and decommit memory within it.

After my_virtual_mem_decommit(), the virtual memory range will be
read-only or unaccessible, depending on whether the build option
cmake -DHAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT=1
has been specified.  This option is hard-coded on Microsoft Windows,
where VirtualMemory(MEM_DECOMMIT) will make the memory unaccessible.
On IBM AIX, Linux, Illumos and possibly Apple macOS, the virtual memory
will be zeroed out immediately.  On other POSIX-like systems,
madvise(MADV_FREE) will be used if available, to give the operating
system kernel a permission to zero out the virtual memory range.
We prefer immediate freeing so that the reported
resident set size (RSS) of the process will reflect the current
innodb_buffer_pool_size.  Shrinking the buffer pool is a rarely
executed resource intensive operation, and the immediate configuration
of the MMU mappings should not incur significant additional penalty.

opt_super_large_pages: Declare only on Solaris. Actually, this is
specific to the SPARC implementation of Solaris, but because we
lack access to a Solaris development environment, we will not revise
this for other MMU and ISA.

buf_pool_t::chunk_t::create(): Remove.

buf_pool_t::create(): Initialize all n_blocks of the buf_pool.free list.

buf_pool_t::allocate(): Renamed from buf_LRU_get_free_only().

buf_pool_t::LRU_warned: Changed to Atomic_relaxed<bool>,
only to be modified by the buf_flush_page_cleaner() thread.

buf_pool_t::shrink(): Attempt to shrink the buffer pool.
There are 3 possible outcomes: SHRINK_DONE (success),
SHRINK_IN_PROGRESS (the caller may keep trying),
and SHRINK_ABORT (we seem to be running out of buffer pool).
While traversing buf_pool.LRU, release the contended
buf_pool.mutex once in every 32 iterations in order to
reduce starvation. Use lru_scan_itr for efficient traversal,
similar to buf_LRU_free_from_common_LRU_list().

buf_pool_t::shrunk(): Update the reduced size of the buffer pool
in a way that is compatible with buf_pool_t::page_guess(),
and invoke my_virtual_mem_decommit().

buf_pool_t::resize(): Before invoking shrink(), run one batch of
buf_flush_page_cleaner() in order to prevent LRU_warn().
Abort if shrink() recommends it, or no blocks were withdrawn in
the past 15 seconds, or the execution of the statement
SET GLOBAL innodb_buffer_pool_size was interrupted.

buf_pool_t::first_to_withdraw: The first block descriptor that is
out of the bounds of the shrunk buffer pool.

buf_pool_t::withdrawn: The list of withdrawn blocks.
If buf_pool_t::resize() is aborted before shrink() completes,
we must be able to resurrect the withdrawn blocks in the free list.

buf_pool_t::contains_zip(): Added a parameter for the
number of least significant pointer bits to disregard,
so that we can find any pointers to within a block
that is supposed to be free.

buf_pool_t::is_shrinking(): Return the total number or blocks that
were withdrawn or are to be withdrawn.

buf_pool_t::to_withdraw(): Return the number of blocks that will need to
be withdrawn.

buf_pool_t::usable_size(): Number of usable pages, considering possible
in-progress attempt at shrinking the buffer pool.

buf_pool_t::page_guess(): Try to buffer-fix a guessed block pointer.
If HAVE_UNACCESSIBLE_AFTER_MEM_DECOMMIT is set, the pointer will
be validated before being dereferenced.

buf_pool_t::get_info(): Replaces buf_stats_get_pool_info().

innodb_init_param(): Refactored. We must first compute
srv_page_size_shift and then determine the valid bounds of
innodb_buffer_pool_size.

buf_buddy_shrink(): Replaces buf_buddy_realloc().
Part of the work is deferred to buf_buddy_condense_free(),
which is being executed when we are not holding any
buf_pool.page_hash latch.

buf_buddy_condense_free(): Do not relocate blocks.

buf_buddy_free_low(): Do not care about buffer pool shrinking.
This will be handled by buf_buddy_shrink() and
buf_buddy_condense_free().

buf_buddy_alloc_zip(): Assert !buf_pool.contains_zip()
when we are allocating from the binary buddy system.
Previously we were asserting this on multiple recursion levels.

buf_buddy_block_free(), buf_buddy_free_low():
Assert !buf_pool.contains_zip().

buf_buddy_alloc_from(): Remove the redundant parameter j.

buf_flush_LRU_list_batch(): Add the parameter to_withdraw
to keep track of buf_pool.n_blocks_to_withdraw.

buf_do_LRU_batch(): Skip buf_free_from_unzip_LRU_list_batch()
if we are shrinking the buffer pool. In that case, we want
to minimize the page relocations and just finish as quickly
as possible.

trx_purge_attach_undo_recs(): Limit purge_sys.n_pages_handled()
in every iteration, in case the buffer pool is being shrunk
in the middle of a purge batch.

Reviewed by: Debarun Banerjee
2025-03-26 17:05:44 +02:00

407 lines
14 KiB
C++

/*****************************************************************************
Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2017, 2022, MariaDB Corporation.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
*****************************************************************************/
/********************************************************************//**
@file include/btr0sea.h
The index tree adaptive search
Created 2/17/1996 Heikki Tuuri
*************************************************************************/
#ifndef btr0sea_h
#define btr0sea_h
#include "dict0dict.h"
#ifdef BTR_CUR_HASH_ADAPT
#include "ha0ha.h"
#include "srw_lock.h"
#ifdef UNIV_PFS_RWLOCK
extern mysql_pfs_key_t btr_search_latch_key;
#endif /* UNIV_PFS_RWLOCK */
#define btr_search_sys_create() btr_search_sys.create()
#define btr_search_sys_free() btr_search_sys.free()
/** Lazily free detached metadata when removing the last reference. */
ATTRIBUTE_COLD void btr_search_lazy_free(dict_index_t *index);
/** Disable the adaptive hash search system and empty the index.
@return whether the adaptive hash index was enabled */
ATTRIBUTE_COLD bool btr_search_disable();
/** Enable the adaptive hash search system.
@param resize whether buf_pool_t::resize() is the caller */
ATTRIBUTE_COLD void btr_search_enable(bool resize= false);
/*********************************************************************//**
Updates the search info. */
UNIV_INLINE
void
btr_search_info_update(
/*===================*/
dict_index_t* index, /*!< in: index of the cursor */
btr_cur_t* cursor);/*!< in: cursor which was just positioned */
/** Tries to guess the right search position based on the hash search info
of the index. Note that if mode is PAGE_CUR_LE, which is used in inserts,
and the function returns TRUE, then cursor->up_match and cursor->low_match
both have sensible values.
@param[in,out] index index
@param[in,out] info index search info
@param[in] tuple logical record
@param[in] mode PAGE_CUR_L, ....
@param[in] latch_mode BTR_SEARCH_LEAF, ...
@param[out] cursor tree cursor
@param[in] mtr mini-transaction
@return whether the search succeeded */
bool
btr_search_guess_on_hash(
dict_index_t* index,
btr_search_t* info,
const dtuple_t* tuple,
ulint mode,
ulint latch_mode,
btr_cur_t* cursor,
mtr_t* mtr);
/** Move or delete hash entries for moved records, usually in a page split.
If new_block is already hashed, then any hash index for block is dropped.
If new_block is not hashed, and block is hashed, then a new hash index is
built to new_block with the same parameters as block.
@param[in,out] new_block destination page
@param[in,out] block source page (subject to deletion later) */
void
btr_search_move_or_delete_hash_entries(
buf_block_t* new_block,
buf_block_t* block);
/** Drop any adaptive hash index entries that point to an index page.
@param[in,out] block block containing index page, s- or x-latched, or an
index page for which we know that
block->buf_fix_count == 0 or it is an index page which
has already been removed from the buf_pool.page_hash
i.e.: it is in state BUF_BLOCK_REMOVE_HASH
@param[in] garbage_collect drop ahi only if the index is marked
as freed */
void btr_search_drop_page_hash_index(buf_block_t* block,
bool garbage_collect);
/** Drop possible adaptive hash index entries when a page is evicted
from the buffer pool or freed in a file, or the index is being dropped.
@param[in] page_id page id */
void btr_search_drop_page_hash_when_freed(const page_id_t page_id);
/** Updates the page hash index when a single record is inserted on a page.
@param[in] cursor cursor which was positioned to the place to insert
using btr_cur_search_, and the new record has been
inserted next to the cursor.
@param[in] ahi_latch the adaptive hash index latch */
void btr_search_update_hash_node_on_insert(btr_cur_t *cursor,
srw_spin_lock *ahi_latch);
/** Updates the page hash index when a single record is inserted on a page.
@param[in,out] cursor cursor which was positioned to the
place to insert using btr_cur_search_...,
and the new record has been inserted next
to the cursor
@param[in] ahi_latch the adaptive hash index latch */
void btr_search_update_hash_on_insert(btr_cur_t *cursor,
srw_spin_lock *ahi_latch);
/** Updates the page hash index when a single record is deleted from a page.
@param[in] cursor cursor which was positioned on the record to delete
using btr_cur_search_, the record is not yet deleted.*/
void btr_search_update_hash_on_delete(btr_cur_t *cursor);
/** Validates the search system.
@param thd connection, for checking if CHECK TABLE has been killed
@return true if ok */
bool btr_search_validate(THD *thd);
/** Lock all search latches in exclusive mode. */
static inline void btr_search_x_lock_all();
/** Unlock all search latches from exclusive mode. */
static inline void btr_search_x_unlock_all();
/** Lock all search latches in shared mode. */
static inline void btr_search_s_lock_all();
/** Unlock all search latches from shared mode. */
static inline void btr_search_s_unlock_all();
# ifdef UNIV_DEBUG
/** @return if the index is marked as freed */
bool btr_search_check_marked_free_index(const buf_block_t *block);
# endif /* UNIV_DEBUG */
#else /* BTR_CUR_HASH_ADAPT */
# define btr_search_sys_create()
# define btr_search_sys_free()
# define btr_search_drop_page_hash_index(block, garbage_collect)
# define btr_search_s_lock_all(index)
# define btr_search_s_unlock_all(index)
# define btr_search_info_update(index, cursor)
# define btr_search_move_or_delete_hash_entries(new_block, block)
# define btr_search_update_hash_on_insert(cursor, ahi_latch)
# define btr_search_update_hash_on_delete(cursor)
# ifdef UNIV_DEBUG
# define btr_search_check_marked_free_index(block)
# endif /* UNIV_DEBUG */
#endif /* BTR_CUR_HASH_ADAPT */
#ifdef BTR_CUR_ADAPT
/** Create and initialize search info.
@param[in,out] heap heap where created
@return own: search info struct */
static inline btr_search_t* btr_search_info_create(mem_heap_t* heap)
MY_ATTRIBUTE((nonnull, warn_unused_result));
/** @return the search info of an index */
static inline btr_search_t* btr_search_get_info(dict_index_t* index)
{
return(index->search_info);
}
#endif /* BTR_CUR_ADAPT */
/** The search info struct in an index */
struct btr_search_t{
/* @{ The following fields are not protected by any latch.
Unfortunately, this means that they must be aligned to
the machine word, i.e., they cannot be turned into bit-fields. */
buf_block_t* root_guess;/*!< the root page frame when it was last time
fetched, or NULL */
#ifdef BTR_CUR_HASH_ADAPT
ulint hash_analysis; /*!< when this exceeds
BTR_SEARCH_HASH_ANALYSIS, the hash
analysis starts; this is reset if no
success noticed */
ibool last_hash_succ; /*!< TRUE if the last search would have
succeeded, or did succeed, using the hash
index; NOTE that the value here is not exact:
it is not calculated for every search, and the
calculation itself is not always accurate! */
ulint n_hash_potential;
/*!< number of consecutive searches
which would have succeeded, or did succeed,
using the hash index;
the range is 0 .. BTR_SEARCH_BUILD_LIMIT + 5 */
/* @} */
ulint ref_count; /*!< Number of blocks in this index tree
that have search index built
i.e. block->index points to this index.
Protected by search latch except
when during initialization in
btr_search_info_create(). */
/*---------------------- @{ */
uint16_t n_fields; /*!< recommended prefix length for hash search:
number of full fields */
uint16_t n_bytes; /*!< recommended prefix: number of bytes in
an incomplete field
@see BTR_PAGE_MAX_REC_SIZE */
bool left_side; /*!< true or false, depending on whether
the leftmost record of several records with
the same prefix should be indexed in the
hash index */
/*---------------------- @} */
#ifdef UNIV_SEARCH_PERF_STAT
ulint n_hash_succ; /*!< number of successful hash searches thus
far */
ulint n_hash_fail; /*!< number of failed hash searches */
ulint n_patt_succ; /*!< number of successful pattern searches thus
far */
ulint n_searches; /*!< number of searches */
#endif /* UNIV_SEARCH_PERF_STAT */
#endif /* BTR_CUR_HASH_ADAPT */
#ifdef UNIV_DEBUG
ulint magic_n; /*!< magic number @see BTR_SEARCH_MAGIC_N */
/** value of btr_search_t::magic_n, used in assertions */
# define BTR_SEARCH_MAGIC_N 1112765
#endif /* UNIV_DEBUG */
};
#ifdef BTR_CUR_HASH_ADAPT
/** The hash index system */
struct btr_search_sys_t
{
/** Partition of the hash table */
struct partition
{
/** latches protecting hash_table */
srw_spin_lock latch;
/** mapping of dtuple_fold() to rec_t* in buf_block_t::frame */
hash_table_t table;
/** memory heap for table */
mem_heap_t *heap;
#ifdef _MSC_VER
#pragma warning(push)
// nonstandard extension - zero sized array, if perfschema is not compiled
#pragma warning(disable : 4200)
#endif
char pad[(CPU_LEVEL1_DCACHE_LINESIZE - sizeof latch -
sizeof table - sizeof heap) &
(CPU_LEVEL1_DCACHE_LINESIZE - 1)];
#ifdef _MSC_VER
#pragma warning(pop)
#endif
void init()
{
memset((void*) this, 0, sizeof *this);
latch.SRW_LOCK_INIT(btr_search_latch_key);
}
void alloc(ulint hash_size)
{
table.create(hash_size);
heap= mem_heap_create_typed(std::min<ulong>(4096,
MEM_MAX_ALLOC_IN_BUF / 2
- MEM_BLOCK_HEADER_SIZE
- MEM_SPACE_NEEDED(0)),
MEM_HEAP_FOR_BTR_SEARCH);
}
void clear()
{
mem_heap_free(heap);
heap= nullptr;
ut_free(table.array);
}
void free()
{
latch.destroy();
if (heap)
clear();
}
};
/** Partitions of the adaptive hash index */
partition *parts;
/** Get an adaptive hash index partition */
partition *get_part(index_id_t id, ulint space_id) const
{
return parts + ut_fold_ulint_pair(ulint(id), space_id) % btr_ahi_parts;
}
/** Get an adaptive hash index partition */
partition *get_part(const dict_index_t &index) const
{
ut_ad(!index.table->space ||
index.table->space->id == index.table->space_id);
return get_part(ulint(index.id), index.table->space_id);
}
/** Get the search latch for the adaptive hash index partition */
srw_spin_lock *get_latch(const dict_index_t &index) const
{ return &get_part(index)->latch; }
/** Create and initialize at startup */
void create()
{
parts= static_cast<partition*>(ut_malloc(btr_ahi_parts * sizeof *parts,
mem_key_ahi));
for (ulong i= 0; i < btr_ahi_parts; ++i)
parts[i].init();
if (btr_search_enabled)
btr_search_enable();
}
void alloc(ulint hash_size)
{
hash_size/= btr_ahi_parts;
for (ulong i= 0; i < btr_ahi_parts; ++i)
parts[i].alloc(hash_size);
}
/** Clear when disabling the adaptive hash index */
void clear() { for (ulong i= 0; i < btr_ahi_parts; ++i) parts[i].clear(); }
/** Free at shutdown */
void free()
{
if (parts)
{
for (ulong i= 0; i < btr_ahi_parts; ++i)
parts[i].free();
ut_free(parts);
parts= nullptr;
}
}
};
/** The adaptive hash index */
extern btr_search_sys_t btr_search_sys;
/** @return number of leaf pages pointed to by the adaptive hash index */
TRANSACTIONAL_INLINE inline ulint dict_index_t::n_ahi_pages() const
{
if (!btr_search_enabled)
return 0;
srw_spin_lock *latch= &btr_search_sys.get_part(*this)->latch;
#if !defined NO_ELISION && !defined SUX_LOCK_GENERIC
if (xbegin())
{
if (latch->is_locked())
xabort();
ulint ref_count= search_info->ref_count;
xend();
return ref_count;
}
#endif
latch->rd_lock(SRW_LOCK_CALL);
ulint ref_count= search_info->ref_count;
latch->rd_unlock();
return ref_count;
}
#ifdef UNIV_SEARCH_PERF_STAT
/** Number of successful adaptive hash index lookups */
extern ulint btr_search_n_succ;
/** Number of failed adaptive hash index lookups */
extern ulint btr_search_n_hash_fail;
#endif /* UNIV_SEARCH_PERF_STAT */
/** After change in n_fields or n_bytes in info, this many rounds are waited
before starting the hash analysis again: this is to save CPU time when there
is no hope in building a hash index. */
#define BTR_SEARCH_HASH_ANALYSIS 17
/** Limit of consecutive searches for trying a search shortcut on the search
pattern */
#define BTR_SEARCH_ON_PATTERN_LIMIT 3
/** Limit of consecutive searches for trying a search shortcut using
the hash index */
#define BTR_SEARCH_ON_HASH_LIMIT 3
/** We do this many searches before trying to keep the search latch
over calls from MySQL. If we notice someone waiting for the latch, we
again set this much timeout. This is to reduce contention. */
#define BTR_SEA_TIMEOUT 10000
#endif /* BTR_CUR_HASH_ADAPT */
#include "btr0sea.inl"
#endif