mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 12:02:42 +01:00

Author	SHA1	Message	Date
Monty	4832e54915	MDEV-20025: ADD_MONTHS() Oracle function Author: woqutech	2021-05-19 22:54:12 +02:00
Monty	cf93209c70	MDEV-20021 sql_mode="oracle" does not support MINUS set operator MINUS is mapped to EXCEPT One consequence of the patch is that MINUS becomes a reserved word in Oracle mode. Author: woqutech	2021-05-19 22:54:12 +02:00
Monty	be093c81a7	MDEV-24089 support oracle syntax: rownum The ROWNUM() function is for SELECT mapped to JOIN->accepted_rows, which is incremented for each accepted rows. For Filesort, update, insert, delete and load data, we map ROWNUM() to internal variables incremented when the table is changed. The connection between the row counter and Item_func_rownum is done in sql_select.cc::fix_items_after_optimize() and sql_insert.cc::fix_rownum_pointers() When ROWNUM() is used anywhere in query, the optimization to ignore ORDER BY in sub queries are disabled. This was done to get the following common Oracle query to work: select * from (select * from t1 order by a desc) as t where rownum() <= 2; MDEV-3926 "Wrong result with GROUP BY ... WITH ROLLUP" contains a discussion about this topic. LIMIT optimization is enabled when in a top level WHERE clause comparing ROWNUM() with a numerical constant using any of the following expressions: - ROWNUM() < # - ROWNUM() <= # - ROWNUM() = 1 ROWNUM() can be also be the right argument to the comparison function. LIMIT optimization is done in two cases: - For the current sub query when the ROWNUM comparison is done on the top level: SELECT * from t1 WHERE rownum() <= 2 AND t1.a > 0 - For an inner sub query, when the upper level has only a ROWNUM comparison in the WHERE clause: SELECT * from (select * from t1) as t WHERE rownum() <= 2 In Oracle mode, one can also use ROWNUM without parentheses. Other things: - Fixed bug where the optimizer tries to optimize away sub queries with RAND_TABLE_BIT set (non-deterministic queries). Now these sub queries will not be converted to joins. This bug fix was also needed to get rownum() working inside subqueries. - In remove_const() remove setting simple_order to FALSE if ROLLUP is USED. This code was disable a long time ago because of wrong assignment in the following code. Instead we set simple_order to false if RAND_TABLE_BIT was used in the SELECT list. This ensures that we don't delete ORDER BY if the result set is not deterministic, like in 'SELECT RAND() AS 'r' FROM t1 ORDER BY r'; - Updated parameters for Sort_param::init_for_filesort() to be able to provide filesort with information where the number of accepted rows should be stored - Reordered fields in class Filesort to optimize storage layout - Added new error messsage to tell that a function can't be used in HAVING - Added field 'with_rownum' to THD to mark that ROWNUM() is used in the query. Co-author: Oleksandr Byelkin <sanja@mariadb.com> LIMIT optimization for sub query	2021-05-19 22:54:11 +02:00
Rucha Deodhar	2fdb556e04	MDEV-8334: Rename utf8 to utf8mb3 This patch changes the main name of 3 byte character set from utf8 to utf8mb3. New old_mode UTF8_IS_UTF8MB3 is added and set TRUE by default, so that utf8 would mean utf8mb3. If not set, utf8 would mean utf8mb4.	2021-05-19 06:48:36 +02:00
Alexey Botchkov	502b769561	MDEV-17399 JSON_TABLE. Aftermerge fixes.	2021-04-21 10:21:48 +04:00
Sergei Petrunia	0fedaf2183	Update result for perfschema.start_server_low_digest_sql_length Post-rebase fix, JSON_TABLE touched the parser.	2021-04-21 10:21:45 +04:00
Oleksandr Byelkin	a3099a3b4a	MDEV-24312 master_host has 60 character limit, increase to 255 bytes Also increase user name up to 128. The work was started by Rucha Deodhar <rucha.deodhar@mariadb.com>, contains audit plugin fixes by Alexey Botchkov <holyfoot@askmonty.org>.	2021-04-20 16:36:56 +02:00
Sujatha	410832313e	MDEV-16437: merge 5.7 P_S replication instrumentation and tables Merge 'replication_applier_status' table. This table captures SQL_THREAD status.	2021-04-16 09:11:48 +05:30
Sujatha	767648cc2b	MDEV-16437: merge 5.7 P_S replication instrumentation and tables Merge 'replication_applier_configuration' table. This table captures SQL_THREAD configuration parameters.	2021-04-16 09:05:39 +05:30
Sujatha	70642871bc	MDEV-16437: merge 5.7 P_S replication instrumentation and tables Merge 'replication_applier_status_by_coordinator' table. This table captures SQL_THREAD status in case of both single threaded and multi threaded slave configuration. When multi_source replication is enabled this table will display each source specific SQL_THREAD status. Added new columns for: - LAST_SEEN_TRANSACTION - LAST_TRANS_RETRY_COUNT	2021-04-16 09:02:00 +05:30
Sujatha	2674365c8e	MDEV-16437: merge 5.7 P_S replication instrumentation and tables Merge 'replication_connection_configuration' table. Replaced following column: - AUTO_POSITION with USING_GTID Added new columns for: - IGNORE_SERVER_IDS - DO_DOMAIN_IDS - IGNORE_SERVER_IDS Removed following columns as they are not part of mariadb replication connection configuration: - NETWORK_INTERFACE - TLS_VERSION @sql/mysqld.cc Changed "master-retry-count" default value to 100000.	2021-04-16 08:54:19 +05:30
Sujatha	036ee61246	MDEV-20220: Merge 5.7 P_S replication table 'replication_applier_status_by_worker Step2: ===== Add two extra columns mentioned below. --------------------------------------------------------------------------- \|Column Name: \| Description: \| \|-------------------------------------------------------------------------\| \| \| \| \|WORKER_IDLE_TIME \| Total idle time in seconds that the worker \| \| \| thread has spent waiting for work from \| \| \| co-ordinator thread \| \| \| \| \|LAST_TRANS_RETRY_COUNT \| Total number of retries attempted by last \| \| \| transaction \| ---------------------------------------------------------------------------	2021-04-08 17:19:51 +05:30
Sujatha	94f1d0f84d	MDEV-20220: Merge 5.7 P_S replication table 'replication_applier_status_by_worker Step1: ===== Backport 'replication_applier_status_by_worker' from upstream. Iterate through rpl_parallel_thread_pool and display slave worker thread specific information as part of 'replication_applier_status_by_worker' table. --------------------------------------------------------------------------- \|Column Name: \| Description: \| \|-------------------------------------------------------------------------\| \| \| \| \|CHANNEL_NAME \| Name of replication channel through which the \| \| \| transaction is received. \| \| \| \| \|THREAD_ID \| Thread_Id as displayed in 'performance_schema. \| \| \| threads' table for thread with name \| \| \| 'thread/sql/rpl_parallel_thread' \| \| \| \| \| \| THREAD_ID will be NULL when worker threads are \| \| \| stopped due to an error/force stop \| \| \| \| \|SERVICE_STATE \| Thread is running or not \| \| \| \| \|LAST_SEEN_TRANSACTION \| Last GTID executed by worker \| \| \| \| \|LAST_ERROR_NUMBER \| Last Error that occured on a particular worker \| \| \| \| \|LAST_ERROR_MESSAGE \| Last error specific message \| \| \| \| \|LAST_ERROR_TIMESTAMP \| Time stamp of last error \| \| \| \| --------------------------------------------------------------------------- CHANNEL_NAME will be empty when the worker has not processed any transaction. Channel_name points to valid source channel_name when it is processing a transaction/event group.	2021-04-08 17:19:51 +05:30
Daniel Black	553ef1a78b	MDEV-13115: Implement SELECT SKIP LOCKED Adds an implementation for SELECT ... FOR UPDATE SKIP LOCKED / SELECT ... LOCK IN SHARED MODE SKIP LOCKED This is implemented only InnoDB at the moment, not in RockDB yet. This adds a new hander flag HA_CAN_SKIP_LOCKED than will be used when the storage engine advertises the flag. When a storage engine indicates this flag it will get TL_WRITE_SKIP_LOCKED and TL_READ_SKIP_LOCKED transaction types. The Lex structure has been updated to store both the FOR UPDATE/LOCK IN SHARE as well as the SKIP LOCKED so the SHOW CREATE VIEW implementation is simplier. "SELECT FOR UPDATE ... SKIP LOCKED" combined with CREATE TABLE AS or INSERT.. SELECT on the result set is not safe for STATEMENT based replication. MIXED replication will replicate this as row based events." Thanks to guidance from Facebook commit `193896c466` This helped verify basic test case, and components that need implementing (even though every part was implemented differently). Thanks Marko for guidance on simplier InnoDB implementation. Reviewers: Marko, Monty	2021-04-08 16:51:36 +10:00
Vladislav Vaintroub	aa2ff62082	MDEV-9077 Use sys schema in bootstrapping, incl. mtr	2021-03-18 08:02:48 +01:00
Varun Gupta	0540e50873	MDEV-25075: Ignorable index makes the resulting CREATE TABLE invalid - The patch itself - More changes to the parser - Fix by Sergei P to make the tests pass with --embedded	2021-03-17 13:45:45 +03:00
Daniel Black	1e7fed721b	mtr: perfschema.socket_{connect,instances_func} "Expect X" Match test output with what it is testing.	2021-03-05 08:25:53 +11:00
Rinat Ibragimov	b3abcf80a1	MDEV-6536: make --bind=hostname to listen on both IPv6 and IPv4 addresses Binding to a hostname now makes MariaDB server to listen on all addresses that hostname resolves to. Rebased to 10.6 by Daniel Black Closes: #1668	2021-03-05 08:25:52 +11:00
Marko Mäkelä	94b4578704	Merge 10.5 into 10.6	2021-02-17 19:39:05 +02:00
Sergei Golubchik	25d9d2e37f	Merge branch 'bb-10.4-release' into bb-10.5-release	2021-02-15 16:43:15 +01:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Marko Mäkelä	b01d8e1a33	MDEV-20612: Replace lock_sys.mutex with lock_sys.latch For now, we will acquire the lock_sys.latch only in exclusive mode, that is, use it as a mutex. This is preparation for the next commit where we will introduce a less intrusive alternative, combining a shared lock_sys.latch with dict_table_t::lock_mutex or a mutex embedded in lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.	2021-02-11 14:52:10 +02:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Marko Mäkelä	46234f03c8	Merge 10.5 into 10.6	2021-01-25 12:56:30 +02:00
Sergei Golubchik	6a1cb449fe	cleanup: remove slave background thread, use handle_manager thread instead	2021-01-24 11:35:55 +01:00
Marko Mäkelä	76b58c2af7	MDEV-24600 performance_schema.events_transactions_history_long.trx_id reports garbage The table performance_schema.events_transactions_history_long that was imported from MySQL 5.7.28 in commit `0ea717f51a` may report bogus trx_id values for InnoDB transactions. innobase_register_trx(): Pass trx->id to trans_register_ha(), even if it is 0. It is more appropriate to report NULL than some arbitrary value that has been constructed from the address of a transaction identifier.	2021-01-15 13:43:05 +02:00
Sergei Golubchik	a216672dab	MDEV-16341 Wrong length for USER columns in performance_schema tables use USERNAME_CHAR_LENGTH and HOSTNAME_LENGTH for perfschema USER and HOST columns	2021-01-11 21:54:48 +01:00
Marko Mäkelä	6268bdadf7	Merge 10.5 into 10.6	2021-01-04 10:52:32 +02:00
Aleksey Midenkov	ed5d675c75	MDEV-24364 Alter rename table does not remove PFS share Add missed PSI_CALL_drop_table_share().	2020-12-22 16:42:33 +03:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	03ca6495df	MDEV-24142: Replace InnoDB rw_lock_t with sux_lock InnoDB buffer pool block and index tree latches depend on a special kind of read-update-write lock that allows reentrant (recursive) acquisition of the 'update' and 'write' locks as well as an upgrade from 'update' lock to 'write' lock. The 'update' lock allows any number of reader locks from other threads, but no concurrent 'update' or 'write' lock. If there were no requirement to support an upgrade from 'update' to 'write', we could compose the lock out of two srw_lock (implemented as any type of native rw-lock, such as SRWLOCK on Microsoft Windows). Removing this requirement is very difficult, so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we implemented an 'update' mode to our srw_lock. Re-entrant or recursive locking is mostly needed when writing or freeing BLOB pages, but also in crash recovery or when merging buffered changes to an index page. The re-entrancy allows us to attach a previously acquired page to a sub-mini-transaction that will be committed before whatever else is holding the page latch. The SUX lock supports Shared ('read'), Update, and eXclusive ('write') locking modes. The S latches are not re-entrant, but a single S latch may be acquired even if the thread already holds an U latch. The idea of the U latch is to allow a write of something that concurrent readers do not care about (such as the contents of BTR_SEG_LEAF, BTR_SEG_TOP and other page allocation metadata structures, or the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field is only updated when a dict_table_t for the table exists, and only read when a dict_table_t for the table is being added to dict_sys.) block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page() to allow concurrent readers but no concurrent modifications while the page is being written to the data file. That latch will be released by buf_page_write_complete() in a different thread. Hence, we use the special lock owner value FOR_IO. The index_lock::u_lock() improves concurrency on operations that involve non-leaf index pages. The interface has been cleaned up a little. We will use x_lock_recursive() instead of x_lock() when we know that a lock is already held by the current thread. Similarly, a lock upgrade from U to X is only allowed via u_x_upgrade() or x_lock_upgraded() but not via x_lock(). We will disable the LatchDebug and sync_array interfaces to InnoDB rw-locks. The SEMAPHORES section of SHOW ENGINE INNODB STATUS output will no longer include any information about InnoDB rw-locks, only TTASEventMutex (cmake -DMUTEXTYPE=event) waits. This will make a part of the 'innotop' script dead code. The block_lock buf_block_t::lock will not be covered by any PERFORMANCE_SCHEMA instrumentation. SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES will no longer output source code file names or line numbers. The dict_index_t::lock will be identified by index and table names, which should be much more useful. PERFORMANCE_SCHEMA is lumping information about all dict_index_t::lock together as event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'. buf_page_free(): Remove the file,line parameters. The sux_lock will not store such diagnostic information. buf_block_dbg_add_level(): Define as empty macro, to be removed in a subsequent commit. Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO the index_lock dict_index_t::lock will be instrumented via PERFORMANCE_SCHEMA. Similar to commit `1669c8890c` we will distinguish lock waits by registering shared_lock,exclusive_lock events instead of try_shared_lock,try_exclusive_lock. Actual 'try' operations will not be instrumented at all. rw_lock_list: Remove. After MDEV-24167, this only covered buf_block_t::lock and dict_index_t::lock. We will output their information by traversing buf_pool or dict_sys.	2020-12-03 15:19:49 +02:00
Marko Mäkelä	3872e585f3	MDEV-24167: Stabilize perfschema.sxlock_func The extension of the test perfschema.sxlock_func in commit `1669c8890c` turned out to be unstable. Let us filter out purge_sys.latch (trx_purge_latch) from the output, because it might happen that the purge tasks will not be executed during the test execution.	2020-12-03 15:15:47 +02:00
Marko Mäkelä	1669c8890c	MDEV-24167 fixup: Improve the PERFORMANCE_SCHEMA instrumentation Let us try to avoid code bloat for the common case that performance_schema is disabled at runtime, and use ATTRIBUTE_NOINLINE member functions for instrumented latch acquisition. Also, let us distinguish lock waits from non-contended lock requests by using write_lock,read_lock for the requests that lead to waits, and try_write_lock,try_read_lock for the wait-free lock acquisitions. Actual 'try' operations are not being instrumented at all.	2020-12-03 09:55:53 +02:00
Marko Mäkelä	e28d9c15c3	MDEV-24167 fixup: Improve perfschema.sxlock_func test	2020-12-01 17:19:53 +02:00
Marko Mäkelä	edbde4a11f	MDEV-24167: Replace fil_space::latch We must avoid acquiring a latch while we are already holding one. The tablespace latch was being acquired recursively in some operations that allocate or free pages.	2020-11-24 15:43:12 +02:00
Marko Mäkelä	bdd88cfa34	MDEV-24167: Replace fts_cache_rw_lock, fts_cache_init_rw_lock with mutex fts_cache_t::init_lock: Replace with mutex. This was only acquired in exclusive mode. fts_cache_t:🔒 Replace with mutex. The only read-lock user was i_s_fts_index_cache_fill() for producing content for the view INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE.	2020-11-24 15:43:12 +02:00
Marko Mäkelä	1a1b7a6f16	MDEV-24167: Replace dict_operation_lock (dict_sys.latch)	2020-11-24 15:43:11 +02:00
Marko Mäkelä	06ef5509d0	MDEV-24167: Replace trx_purge_latch	2020-11-24 15:43:10 +02:00
Marko Mäkelä	63dd2a97e4	MDEV-24167: Replace trx_i_s_cache_lock	2020-11-24 15:43:09 +02:00
Marko Mäkelä	c561f9e6e8	MDEV-24167: Use lightweight srw_lock for btr_search_latch Many InnoDB rw-locks unnecessarily depend on the complex InnoDB rw_lock_t implementation that support the SX lock mode as well as recursive acquisition of X or SX locks. One of them is the bunch of adaptive hash index search latches, instrumented as btr_search_latch in PERFORMANCE_SCHEMA. Let us introduce a simpler lock for those in order to reduce overhead. srw_lock: A simple read-write lock that does not support recursion. On Microsoft Windows, this wraps SRWLOCK, only adding runtime overhead if PERFORMANCE_SCHEMA is enabled. On Linux (all architectures), this is implemented with std::atomic<uint32_t> and the futex system call. On other platforms, we will wrap mysql_rwlock_t with zero runtime overhead. The PERFORMANCE_SCHEMA instrumentation differs from InnoDB rw_lock_t in that we will only invoke PSI_RWLOCK_CALL(start_rwlock_wrwait) or PSI_RWLOCK_CALL(start_rwlock_rdwait) if there is an actual conflict.	2020-11-24 15:41:03 +02:00
Marko Mäkelä	a62a675fd2	Merge 10.5 into 10.6	2020-11-23 17:57:58 +02:00
Marko Mäkelä	1e5d989d2a	MDEV-24167: Remove PFS instrumentation of buf_block_t We always defined PFS_SKIP_BUFFER_MUTEX_RWLOCK, that is, the latches of the buffer pool blocks were never instrumented in PERFORMANCE_SCHEMA. For some reason, the debug_latch (which enforce proper usage of buffer-fixing in debug builds) was instrumented.	2020-11-20 08:55:41 +02:00
Marko Mäkelä	fabdad6857	Merge 10.4 into 10.5	2020-11-17 18:15:13 +02:00
Marko Mäkelä	bbf0b55c17	Work around MDEV-24232: Skip perfschema.nesting if WITH_WSREP=OFF	2020-11-17 18:13:47 +02:00
Oleksandr Byelkin	10aa576483	Merge branch '10.5' into 10.6	2020-11-14 20:05:35 +01:00
Marko Mäkelä	d7a5824899	Merge 10.4 into 10.5	2020-11-13 21:54:21 +02:00
Marko Mäkelä	09a1f0075a	Merge 10.5 into 10.6	2020-11-02 12:49:19 +02:00
Marko Mäkelä	898521e2dd	Merge 10.4 into 10.5	2020-10-30 11:15:30 +02:00
Sergei Golubchik	e764d11829	fix occasisonal test failures: SELECT without ORDER BY	2020-10-24 11:15:51 +02:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00

1 2 3 4 5 ...

584 commits