Analysis: Problem is that page is encrypted but encryption information
on page 0 has already being changed.
Fix: If page header contains key_version != 0 and even if based on
current encryption information tablespace is not encrypted we
need to check is page corrupted. If it is not, then we know that
page is not encrypted. If page is corrupted, we need to try to
decrypt it and then compare the stored and calculated checksums
to see is page corrupted or not.
Two problems:
(1) When pushing warning to sql-layer we need to check that thd != NULL
to avoid NULL-pointer reference.
(2) At tablespace key rotation if used key_id is not found from
encryption plugin tablespace should not be rotated.
MDEV-10394: Innodb system table space corrupted
Analysis: After we have read the page in buf_page_io_complete try to
find if the page is encrypted or corrupted. Encryption was determined
by reading FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION field from FIL-header
as a key_version. However, this field is not always zero even when
encryption is not used. Thus, incorrect key_version could lead situation where
decryption is tried to page that is not encrypted.
Fix: We still read key_version information from FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION
field but also check if tablespace has encryption information before trying
encrypt the page.
No point to issue RELEASE memory barrier in os_thread_create_func(): thread
creation is full memory barrier.
No point to issue os_wmb in rw_lock_set_waiter_flag() and
rw_lock_reset_waiter_flag(): this is deadcode and it is unlikely operational
anyway. If atomic builtins are unavailable - memory barriers are most certainly
unavailable too.
RELEASE memory barrier is definitely abused in buf_pool_withdraw_blocks(): most
probably it was supposed to commit volatile variable update, which is not what
memory barriers actually do. To operate properly it needs corresponding ACQUIRE
barrier without an associated atomic operation anyway.
ACQUIRE memory barrier is definitely abused in log_write_up_to(): most probably
it was supposed to synchronize dirty read of log_sys->write_lsn. To operate
properly it needs corresponding RELEASE barrier without an associated atomic
operation anyway.
Removed a bunch of ACQUIRE memory barriers from InnoDB rwlocks. They're
meaningless without corresponding RELEASE memory barriers.
Valid usage example of memory barriers without an associated atomic operation:
http://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
Replaced InnoDB atomic operations with server atomic operations.
Moved INNODB_RW_LOCKS_USE_ATOMICS - it is always defined (code won't compile
otherwise).
NOTE: InnoDB uses thread identifiers as a target for atomic operations.
Thread identifiers should be considered opaque: any attempt to use a
thread ID other than in pthreads calls is nonportable and can lead to
unspecified results.
Using numa_all_nodes_ptr was excessively optimistic. Due to
constraints in systemd, containers or otherwise mysqld could of been
limited to a smaller set of cpus. Use the numa_get_mems_allowed
library function to see what we can interleave between before doing
so. The alternative is to fail interleaving overall.
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Contains also:
MDEV-10549 mysqld: sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. (branch bb-10.2-jan)
Unlike MySQL, InnoDB still uses THR_LOCK in MariaDB
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
enable tests that were fixed in MDEV-10549
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
fix main.innodb_mysql_sync - re-enable online alter for partitioned innodb tables
Contains also
MDEV-10547: Test multi_update_innodb fails with InnoDB 5.7
The failure happened because 5.7 has changed the signature of
the bool handler::primary_key_is_clustered() const
virtual function ("const" was added). InnoDB was using the old
signature which caused the function not to be used.
MDEV-10550: Parallel replication lock waits/deadlock handling does not work with InnoDB 5.7
Fixed mutexing problem on lock_trx_handle_wait. Note that
rpl_parallel and rpl_optimistic_parallel tests still
fail.
MDEV-10156 : Group commit tests fail on 10.2 InnoDB (branch bb-10.2-jan)
Reason: incorrect merge
MDEV-10550: Parallel replication can't sync with master in InnoDB 5.7 (branch bb-10.2-jan)
Reason: incorrect merge
Analysis: When pages in doublewrite buffer are analyzed compressed
pages do not have correct checksum.
Fix: Decompress page before checksum is compared. If decompression
fails we still check checksum and corrupted pages are found.
If decompression succeeds, page now contains the original
checksum.
There was two problems. Firstly, if page in ibuf is encrypted but
decrypt failed we should not allow InnoDB to start because
this means that system tablespace is encrypted and not usable.
Secondly, if page decrypt is detected we should return false
from buf_page_decrypt_after_read.
we can see from the hang stacktrace, srv_monitor_thread is blocked
when getting log_sys::mutex, so that sync_arr_wake_threads_if_sema_free
cannot get a change to break the mutex deadlock.
The fix is simply removing any mutex wait in srv_monitor_thread.
Patch is reviewed by Sunny over IM.
Backport pull request #125 from grooverdan/MDEV-8923_innodb_buffer_pool_dump_pct to 10.0
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
Add testcases for innodb_buffer_pool_dump_pct_basic.
Part of the code authored by Daniel Black
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
Analysis: When a page is read from encrypted table and page can't be
decrypted because of bad key (or incorrect encryption algorithm or
method) page was incorrectly left on buffer pool.
Fix: Remove page from buffer pool and from pending IO.
Analysis: Problem was that in fil_read_first_page we do find that
table has encryption information and that encryption service
or used key_id is not available. But, then we just printed
fatal error message that causes above assertion.
Fix: When we open single table tablespace if it has encryption
information (crypt_data) store this crypt data to the table
structure. When we open a table and we find out that tablespace
is not available, check has table a encryption information
and from there is encryption service or used key_id is not available.
If it is, add additional warning for SQL-layer.
MDEV-8409: Changing file-key-management-encryption-algorithm causes crash and no real info why
Analysis: Both bugs has two different error cases. Firstly, at startup
when server reads latest checkpoint but requested key_version,
key management plugin or encryption algorithm or method is not found
leading corrupted log entry. Secondly, similarly when reading system
tablespace if requested key_version, key management plugin or encryption
algorithm or method is not found leading buffer pool page corruption.
Fix: Firsly, when reading checkpoint at startup check if the log record
may be encrypted and if we find that it could be encrypted, print error
message and do not start server. Secondly, if page is buffer pool seems
corrupted but we find out that there is crypt_info, print additional
error message before asserting.
Added new dynamic configuration variable innodb_buf_dump_status_frequency
to configure how often buffer pool dump status is printed in the logs.
A number between [0, 100] that tells how oftern buffer pool dump status
in percentages should be printed. E.g. 10 means that buffer pool dump
status is printed when every 10% of number of buffer pool pages are
dumped. Default is 0 (only start and end status is printed).
Analysis: Problem was that actual payload size (page size) after compression
was handled incorrectly on encryption. Additionally, some of the variables
were not initialized.
Fixed by encrypting/decrypting only the actual compressed page size.
Analysis: Problem is that there is not enough temporary buffer slots
for pending IO requests.
Fixed by allocating same amount of temporary buffer slots as there
are max pending IO requests.
Analysis: Problem is that both encrypted tables and compressed tables use
FIL header offset FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION to store
required metadata. Furhermore, for only compressed tables currently
code skips compression.
Fixes:
- Only encrypted pages store key_version to FIL header offset FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION,
no need to fix
- Only compressed pages store compression algorithm to FIL header offset FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION,
no need to fix as they have different page type FIL_PAGE_PAGE_COMPRESSED
- Compressed and encrypted pages now use a new page type FIL_PAGE_PAGE_COMPRESSED_ENCRYPTED and
key_version is stored on FIL header offset FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION and compression
method is stored after FIL header similar way as compressed size, so that first
FIL_PAGE_COMPRESSED_SIZE is stored followed by FIL_PAGE_COMPRESSION_METHOD
- Fix buf_page_encrypt_before_write function to really compress pages if compression is enabled
- Fix buf_page_decrypt_after_read function to really decompress pages if compression is used
- Small style fixes
Make sure that when we publish the crypt_data we access the
memory cache of the tablespace crypt_data. Make sure that
crypt_data is stored whenever it is really needed.
All this is not yet enough in my opinion because:
sql/encryption.cc has DBUG_ASSERT(scheme->type == 1) i.e.
crypt_data->type == CRYPT_SCHEME_1
However, for InnoDB point of view we have global crypt_data
for every tablespace. When we change variables on crypt_data
we take mutex. However, when we use crypt_data for
encryption/decryption we use pointer to this global
structure and no mutex to protect against changes on
crypt_data.
Tablespace encryption starts in fil_crypt_start_encrypting_space
from crypt_data that has crypt_data->type = CRYPT_SCHEME_UNENCRYPTED
and later we write page 0 CRYPT_SCHEME_1 and finally whe publish
that to memory cache.
Analysis: Problem was that tablespaces not encrypted might not have
crypt_data stored on disk.
Fixed by always creating crypt_data to memory cache of the tablespace.
MDEV-8138: strange results from encrypt-and-grep test
Analysis: crypt_data->type is not updated correctly on memory
cache. This caused problem with state tranfer on
encrypted => unencrypted => encrypted.
Fixed by updating memory cache of crypt_data->type correctly based on
current srv_encrypt_tables value to either CRYPT_SCHEME_1 or
CRYPT_SCHEME_UNENCRYPTED.
Analysis: Problem was that we did create crypt data for encrypted table but
this new crypt data was not written to page 0. Instead a default crypt data
was written to page 0 at table creation.
Fixed by explicitly writing new crypt data to page 0 after successfull
table creation.
With changes:
* update tests to pass (new encryption/encryption_key_id syntax).
* not merged the code that makes engine aware of the encryption mode
(CRYPT_SCHEME_1_CBC, CRYPT_SCHEME_1_CTR, storing it on disk, etc),
because now the encryption plugin is handling it.
* compression+encryption did not work in either branch before the
merge - and it does not work after the merge. it might be more
broken after the merge though - some of that code was not merged.
* page checksumming code was not moved (moving of page checksumming
from fil_space_encrypt() to fil_space_decrypt was not merged).
* restored deleted lines in buf_page_get_frame(), otherwise
innodb_scrub test failed.
Step 3:
-- Make encrytion_algorithm changeable by SUPER
-- Remove AES_ECB method from encryption_algorithms
-- Support AES method change by storing used method on InnoDB/XtraDB objects
-- Store used AES method to crypt_data as different crypt types
-- Store used AES method to redo/undo logs and checkpoint
-- Store used AES method on every encrypted page after key_version
-- Add test
Step 2:
-- Introduce temporal memory array to buffer pool where to allocate
temporary memory for encryption/compression
-- Rename PAGE_ENCRYPTION -> ENCRYPTION
-- Rename PAGE_ENCRYPTION_KEY -> ENCRYPTION_KEY
-- Rename innodb_default_page_encryption_key -> innodb_default_encryption_key
-- Allow enable/disable encryption for tables by changing
ENCRYPTION to enum having values DEFAULT, ON, OFF
-- In create table store crypt_data if ENCRYPTION is ON or OFF
-- Do not crypt tablespaces having ENCRYPTION=OFF
-- Store encryption mode to crypt_data and redo-log
Step 1:
-- Remove page encryption from dictionary (per table
encryption will be handled by storing crypt_data to page 0)
-- Remove encryption/compression from os0file and all functions
before that (compression will be added to buf0buf.cc)
-- Use same CRYPT_SCHEME_1 for all encryption methods
-- Do some code cleanups to confort InnoDB coding style
TO USE A SECOND WATCH PAGE PER INSTANCE
Description:
BUF_POOL_WATCH_SIZE is also initialized to number of purge threads.
so BUF_POOL_WATCH_SIZE will never be lesser than number of purge threads.
From the code, there is no scope for purge thread to skip buf_pool_watch_unset.
So there can be at most one buffer pool watch active per purge thread.
In other words, there is no chance for purge thread instance to hold a watch
when setting another watch.
Solution:
Adding code comments to clarify the issue.
Reviewed-by: Marko Mäkelä <marko.makela@oracle.com>
Approved via Bug page.
This patch ports the work that facebook has performed
to make innochecksum handle compressed tables.
the basic idea is to use actual innodb-code to perform
checksum verification rather than duplicating in innochecksum.cc.
to make this work, innodb code has been annotated with
lots of #ifndef UNIV_INNOCHECKSUM so that it can be
compiled outside of storage/innobase.
A new testcase is also added that verifies that innochecksum
works on compressed/non-compressed tables.
Merged from commit fabc79d2ea976c4ff5b79bfe913e6bc03ef69d42
from https://code.google.com/p/google-mysql/
The actual steps to produce this patch are:
take innochecksum from 5.6.14
apply changes in innodb from facebook patches needed to make innochecksum compile
apply changes in innochecksum from facebook patches
add handcrafted testcase
The referenced facebook patches used are:
91e25120e7847fe76ea51135628a5a4dbf7c240c
in file buf0mtflu.cc line 439
Analysis: At shutdown multi-threaded flush sends a exit work
items to all mtflush threads. We wait until the work queue is
empty. However, as we did not hold the mutex, some other thread
could also put work-items to work queue.
Fix: Take mutex before adding exit work items to work queue and
wait until all work-items are really processed. Release
mutex after we have marked that multi-threaded flush is not
anymore active.
Fix test failure on innodb_bug12902967 caused by unnecessary
info output on xtradb/buf/buf0mtflush.cc.
Do not try to enable atomic writes if the file type
is not OS_DATA_FILE. Atomic writes are unnecessary
for log files. If we try to enable atomic writes
to log writes that are stored to media supporting
atomic writes we will end up problems later.
Analysis: InnoDB error monitor is responsible to call every second
sync_arr_wake_threads_if_sema_free() to wake up possible hanging
threads if they are missed in mutex_signal_object. This is not
possible if error monitor itself is on mutex/semaphore wait. We
should avoid all unnecessary mutex/semaphore waits on error monitor.
Currently error monitor calls function buf_flush_stat_update()
that calls log_get_lsn() function and there we will try to get
log_sys mutex. Better, solution for error monitor is that in
buf_flush_stat_update() we will try to get lsn with
mutex_enter_nowait() and if we did not get mutex do not update
the stats.
Fix: Use log_get_lsn_nowait() function on buf_flush_stat_update()
function. If returned lsn is 0, we do not update flush stats.
log_get_lsn_nowait() will use mutex_enter_nowait() and if
we get mutex we return a correct lsn if not we return 0.
Analysis: InnoDB error monitor is responsible to call every second
sync_arr_wake_threads_if_sema_free() to wake up possible hanging
threads if they are missed in mutex_signal_object. This is not
possible if error monitor itself is on mutex/semaphore wait. We
should avoid all unnecessary mutex/semaphore waits on error monitor.
Currently error monitor calls function buf_flush_stat_update()
that calls log_get_lsn() function and there we will try to get
log_sys mutex. Better, solution for error monitor is that in
buf_flush_stat_update() we will try to get lsn with
mutex_enter_nowait() and if we did not get mutex do not update
the stats.
Fix: Use log_get_lsn_nowait() function on buf_flush_stat_update()
function. If returned lsn is 0, we do not update flush stats.
log_get_lsn_nowait() will use mutex_enter_nowait() and if
we get mutex we return a correct lsn if not we return 0.
Merged Facebook commit 617aef9f911d825e9053f3d611d0389e02031225
authored by Inaam Rana to InnoDB storage engine (not XtraDB)
from https://github.com/facebook/mysql-5.6
WL#7047 - Optimize buffer pool list scans and related batch processing
Reduce excessive scanning of pages when doing flush list batches. The
fix is to introduce the concept of "Hazard Pointer", this reduces the
time complexity of the scan from O(n*n) to O.
The concept of hazard pointer is reversed in this work. Academically
hazard pointer is a pointer that the thread working on it will declar
such and as long as that thread is not done no other thread is allowe
do anything with it.
In this WL we declare the pointer as a hazard pointer and then if any
thread attempts to work on it, it is allowed to do so but it has to a
the hazard pointer to the next valid value. We use hazard pointer sol
reverse traversal of lists within a buffer pool instance.
Add an event to control the background flush thread. The background f
thread wait has been converted to an os event timed wait so that it c
signalled by threads that want to kick start a background flush when
buffer pool is running low on free/dirty pages.
Merge Facebook commit 4f3e0343fd2ac3fc7311d0ec9739a8f668274f0d
authored by Steaphan Greene from https://github.com/facebook/mysql-5.6
Adds innodb_idle_flush_pct to enable tuning of the page flushing rate
when the system is relatively idle. We care about this, since doing
extra unnecessary flash writes shortens the lifespan of the flash.
Merged Facebook commit ecff018632c6db49bad73d9233c3cdc9f41430e9
authored by Steaphan Greene from https://github.com/facebook/mysql-5.6
This change is to fix: http://bugs.mysql.com/62534
This makes innodb_max_dirty_pages_pct a double with min,default,max values
0.001, 75, 99.999.
This also makes innodb_max_dirty_pages_pct_lwm and adaptive_flushing_lwm
doubles, as these sysvars are inter-dependent.
Added more to the BUFFER POOL AND MEMORY section of SHOW INNODB STATUS:
Percent pages dirty: X.X
This is all n_dirty_pages / used_pages
Percent all pages dirty: X.X
This is all n_dirty_pages / all-pages
Max dirty pages percent: X.X
This is innodb_max_dirty_pages_pct
Also changed all of buf from 2 to 3 digits of precision (%.2f -> %.3f).
compressed pages
buf_flush_LRU() returns the number of pages processed. There are
two types of processing that can happen. A page can get evicted or
a page can get flushed. These two numbers are quite distinct and
should not be mixed.
Merge Facebook commit 926a077b14b73c14094de7fc7aa913241b801b4d
authored by Inaam Rana from https://github.com/facebook/mysql-5.6.
This is fix for upstream bugs
http://bugs.mysql.com/bug.php?id=71988http://bugs.mysql.com/bug.php?id=70500
page_cleaner should work whether or not there is server activity.
Its iterations become a noop when there is no work to do but we
should not tie it to the server activity.
The page_cleaner thread does spurious background flushing
because of conditional sleep between iterations. The solution
is not to make sleep dependent on server activity etc.
Merged Facebooks commit 6e06bbfa315ffb97d713dd6e672d6054036ddc21
authored by Inaam Rana from https://github.com/facebook/mysql-5.6.
Fixes MySQL bug http://bugs.mysql.com/bug.php?id=72123
lock_timeout thread works in a tight loop waking up every second
and checking for lock_wait_timeout. In addition, when a mysql
thread is forced to wait on a lock, it signals the lock_timeout thread
as well. This call is not required. In a heavily contended workload
each thread going to wait will signal the lock_timeout thread making
it work all the time. As lock_timeout thread scans the array of
waiting threads under lock_sys::wait_mutex which is already very
hot in contneded loads, these extra scans can cause significanct
performance regression.
Also, in various codepaths lock_timeout thread is signalled where
actual intention was to signal the innodb monitor thread.
Merged lp:maria/maria-10.0-galera up to revision 3879.
Added a new functions to handler API to forcefully abort_transaction,
producing fake_trx_id, get_checkpoint and set_checkpoint for XA. These
were added for future possiblity to add more storage engines that
could use galera replication.
buf0buf.cc line 2642.
Analysis: innodb_compression_algorithm is a global variable and
can change while we are building page compressed page. This could
lead page corruption.
Fix: Cache innodb_compression_algorithm on local variable before
doing any compression or page formating to avoid concurrent
change. Improved page verification on debug builds.
Analysis: InnoDB writes also files that do not contain FIL-header.
This could lead incorrect analysis on os_fil_read_func function
when it tries to see is page page compressed based on FIL_PAGE_TYPE
field on FIL-header. With bad luck uncompressed page that does
not contain FIL-headed, the byte on FIL_PAGE_TYPE position could
indicate that page is page comrpessed.
Fix: Upper layer must indicate is file space page compressed
or not. If this is not yet known, we need to read the FIL-header
and find it out. Files that we know that are not page compressed
we can always just provide FALSE.
IN RECOVERY
During redo log processing, the data dictionary is not available. We should
check it in dict_find_table_by_space() to prevent SEGV error.
rb#5678, approved by Jimmy.
IN RECOVERY
During redo log processing, the data dictionary is not available. We should
check it in dict_find_table_by_space() to prevent SEGV error.
rb#5678, approved by Jimmy.
4229: MDEV-5670: Assertion failure in file buf0lru.c line 2355
Add more status information if repeatable.
4230: MDEV-5673: Crash while parallel dropping multiple tables under heavy load
Improve long semaphore wait output to include all semaphore waits
and try to find out if there is a sequence of waiters.
4233: Fix compiler errors on product build.
4237: Fix too agressive long semaphore wait output and add guard against introducing
compression failures on insert buffer.
4238: Fix test failure caused by simulated compression failure on
IBUF_DUMMY table.
in file buf0mtflu.cc line 570.
Analysis: Real timing bug, we should take the mutex before we
try to send those shutdown messages, that would make sure
that threads doing a unfinished flush (they have acquired
this mutex) have time to do their work before we add shutdown
messages to work queue. Currently, we just add those shutdown
messages to work queue and code assumes that at flush, there
is constant number of items to be processed and thus
leading to assertion.
special builds UNIV_PAGECOMPRESS_DEBUG and UNIV_MTFLUSH_DEBUG).
Added a new status variable compress_pages_page_compression_error to count possible
compression errors.
Added option to disable multi-threaded flush with innodb_use_mtflush = 0
option, by default multi-threaded flush is used.
Updated innochecksum tool, still it does not support new checksums.
execution and global memory heap on shutdown.
Added a funcition to get work items from queue without waiting and
additional info when there is no work to do for a extended periods.
waiting for a job. They should sleep and let other threads to their work. At
shutdown, we know that we put "work" and that is handled as soon as possible.
Fixed issue on multi-threaded flush at shutdown.
Removed unnecessary startup option innodb_compress_pages.
Added a new startup option innodb_mtflush_threads, default 8.
Update InnoDB to 5.6.14
Apply MySQL-5.6 hack for MySQL Bug#16434374
Move Aria-only HA_RTREE_INDEX from my_base.h to maria_def.h (breaks an assert in InnoDB)
Fix InnoDB memory leak
SYNTAX: ATOMIC_WRITES=['DEFAULT','ON','OFF']
Idea here is to be able to define innodb_doublewrite = 1 but with following rules:
ATOMIC_WRITES='DEFAULT' - if innodb_use_atomic_writes = 1, we do not write to doublewrite buffer the changes
if innodb_use_atomic_writes = 0, we write to doublewrite buffer
ATOMIC_WRITES='ON' - do not write to doublewrite buffer
ATOMIC_WRITES='OFF' - write to doublewrite buffer
Note that doublewrite buffer can't be used if innodb_doublewrite = 0.
written and rest of the page is trimmed. In following writes
there is no need to trim again if write_size only increases
because rest of the page is already trimmed. If actual write
size decreases we need to trim again. Need to research if this
can happen frequently enough to make any effect.
After a clean shutdown, InnoDB will not check the *.ibd file headers,
for maximum performance. This is unchanged before and after this
patch.
What this fix addresses is the case when crash recovery is
needed. Previously, InnoDB could load a corrupted tablespace file.
buf_page_is_corrupted(): Add the parameter check_lsn.
fil_check_first_page(): New function, to perform a consistency check
on the first page of a file. This can be overridden by setting
innodb_force_recovery.
fil_read_first_page(), fil_open_single_table_tablespace(),
fil_load_single_table_tablespace(): Invoke fil_check_first_page().
open_or_create_data_files(): Check the status of
fil_open_single_table_tablespace().
rb#2352 approved by Jimmy Yang
After a clean shutdown, InnoDB will not check the *.ibd file headers,
for maximum performance. This is unchanged before and after this
patch.
What this fix addresses is the case when crash recovery is
needed. Previously, InnoDB could load a corrupted tablespace file.
buf_page_is_corrupted(): Add the parameter check_lsn.
fil_check_first_page(): New function, to perform a consistency check
on the first page of a file. This can be overridden by setting
innodb_force_recovery.
fil_read_first_page(), fil_open_single_table_tablespace(),
fil_load_single_table_tablespace(): Invoke fil_check_first_page().
open_or_create_data_files(): Check the status of
fil_open_single_table_tablespace().
rb#2352 approved by Jimmy Yang
I/O IS ASYNC
rb://1934
approved by: Mikael Ronstrom (over email)
When submitting AIO read request don't signal that the thread is
about to wait on DISKIO
I/O IS ASYNC
rb://1934
approved by: Mikael Ronstrom (over email)
When submitting AIO read request don't signal that the thread is
about to wait on DISKIO
Backporting the WL#5716, "Information schema table for InnoDB
buffer pool information". Backporting revisions 2876.244.113,
2876.244.102 from mysql-trunk.
rb://1177 approved by Jimmy Yang.
Backporting the WL#5716, "Information schema table for InnoDB
buffer pool information". Backporting revisions 2876.244.113,
2876.244.102 from mysql-trunk.
rb://1177 approved by Jimmy Yang.
Changes in the InnoDB codebase required to compile and
integrate the MEB codebase with MySQL 5.5.
@ storage/innobase/btr/btr0btr.c
Excluded buffer pool usage from MEB build.
buf_pool_from_bpage calls are in buf0buf.ic, and
the buffer pool functions from that file are
disabled in MEB.
@ storage/innobase/buf/buf0buf.c
Disabling more buffer pool functions unused in MEB.
@ storage/innobase/dict/dict0dict.c
Disabling dict_ind_free that is unused in MEB.
@ storage/innobase/dict/dict0mem.c
The include
#include "ha_prototypes.h"
Was causing conflicts with definitions in my_global.h
Linking C executable mysqlbackup
libinnodb.a(dict0mem.c.o): In function `dict_mem_foreign_table_name_lookup_set':
dict0mem.c:(.text+0x91c): undefined reference to `innobase_get_lower_case_table_names'
libinnodb.a(dict0mem.c.o): In function `dict_mem_referenced_table_name_lookup_set':
dict0mem.c:(.text+0x9fc): undefined reference to `innobase_get_lower_case_table_names'
libinnodb.a(dict0mem.c.o): In function `dict_mem_foreign_table_name_lookup_set':
dict0mem.c:(.text+0x96e): undefined reference to `innobase_casedn_str'
libinnodb.a(dict0mem.c.o): In function `dict_mem_referenced_table_name_lookup_set':
dict0mem.c:(.text+0xa4e): undefined reference to `innobase_casedn_str'
collect2: ld returned 1 exit status
make[2]: *** [mysqlbackup] Error 1
innobase_get_lower_case_table_names
innobase_casedn_str
are functions that are part of ha_innodb.cc that is not part of the build
dict_mem_foreign_table_name_lookup_set
function is not there in the current codebase, meaning we do not use it in MEB.
@ storage/innobase/fil/fil0fil.c
The srv_fast_shutdown variable is declared in
srv0srv.c that is not compiled in the
mysqlbackup codebase.
This throws an undeclared error.
From the Manual
---------------
innodb_fast_shutdown
--------------------
The InnoDB shutdown mode. The default value is 1
as of MySQL 3.23.50, which causes a “fast� shutdown
(the normal type of shutdown). If the value is 0,
InnoDB does a full purge and an insert buffer merge
before a shutdown. These operations can take minutes,
or even hours in extreme cases. If the value is 1,
InnoDB skips these operations at shutdown.
This ideally does not matter from mysqlbackup
@ storage/innobase/ha/ha0ha.c
In file included from /home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/ha/ha0ha.c:34:0:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/btr0sea.h:286:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
make[2]: *** [CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/ha/ha0ha.c.o] Error 1
make[1]: *** [CMakeFiles/innodb.dir/all] Error 2
make: *** [all] Error 2
# include "sync0rw.h" is excluded from hotbackup compilation in dict0dict.h
This causes extern rw_lock_t* btr_search_latch_temp; to throw a failure because
the definition of rw_lock_t is not found.
@ storage/innobase/include/buf0buf.h
Excluding buffer pool functions that are unused from the
MEB codebase.
@ storage/innobase/include/buf0buf.ic
replicated the exclusion of
#include "buf0flu.h"
#include "buf0lru.h"
#include "buf0rea.h"
by looking at the current codebase in <meb-trunk>/src/innodb
@ storage/innobase/include/dict0dict.h
dict_table_x_lock_indexes, dict_table_x_unlock_indexes, dict_table_is_corrupted,
dict_index_is_corrupted, buf_block_buf_fix_inc_func are unused in MEB and was
leading to compilation errors and hence excluded.
@ storage/innobase/include/dict0dict.ic
dict_table_x_lock_indexes, dict_table_x_unlock_indexes, dict_table_is_corrupted,
dict_index_is_corrupted, buf_block_buf_fix_inc_func are unused in MEB and was
leading to compilation errors and hence excluded.
@ storage/innobase/include/log0log.h
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/log0log.h: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/log0log.h:767:2: error: expected specifier-qualifier-list before â⠂¬Ëœmutex_t’
mutex_t definitions were excluded as seen from ambient code
hence excluding definition for log_flush_order_mutex also.
@ storage/innobase/include/os0file.h
Bug in InnoDB code, create_mode should have been create.
@ storage/innobase/include/srv0srv.h
In file included from /home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/buf/buf0buf.c:50:0:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/srv0srv.h: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/srv0srv.h:120:16: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘srv_use_native_aio’
srv_use_native_aio - we do not use native aio of the OS anyway from MEB. MEB does not compile
InnoDB with this option. Hence disabling it.
@ storage/innobase/include/trx0sys.h
[ 56%] Building C object CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c.o
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c: In function ‘trx_sys_read_file_format_id’:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c:1499:20: error: ‘TRX_SYS_FILE_FORMAT_TAG_MAGIC_N’ undeclared (first use in this function)
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c:1499:20: note: each undeclared identifier is reported only once for each function it appears in
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/buf0buf.h:607:1: warning: ‘buf_block_buf_fix_inc_func’ declared ‘static’ but never defined
make[2]: *** [CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c.o] Error 1
unused calls excluded to enable compilation
@ storage/innobase/mem/mem0dbg.c
excluding #include "ha_prototypes.h" that lead to definitions in ha_innodb.cc
@ storage/innobase/os/os0file.c
InnoDB not compiled with aio support from MEB anyway. Hence excluding this from
the compilation.
@ storage/innobase/page/page0zip.c
page0zip.c:(.text+0x4e9e): undefined reference to `buf_pool_from_block'
collect2: ld returned 1 exit status
buf_pool_from_block defined in buf0buf.ic, most of the file is excluded for compilation of MEB
@ storage/innobase/ut/ut0dbg.c
excluding #include "ha_prototypes.h" since it leads to definitions in ha_innodb.cc
innobase_basename(file) is defined in ha_innodb.cc. Hence excluding that also.
@ storage/innobase/ut/ut0ut.c
cal_tm unused from MEB, was leading to earnings, hence disabling for MEB.
Changes in the InnoDB codebase required to compile and
integrate the MEB codebase with MySQL 5.5.
@ storage/innobase/btr/btr0btr.c
Excluded buffer pool usage from MEB build.
buf_pool_from_bpage calls are in buf0buf.ic, and
the buffer pool functions from that file are
disabled in MEB.
@ storage/innobase/buf/buf0buf.c
Disabling more buffer pool functions unused in MEB.
@ storage/innobase/dict/dict0dict.c
Disabling dict_ind_free that is unused in MEB.
@ storage/innobase/dict/dict0mem.c
The include
#include "ha_prototypes.h"
Was causing conflicts with definitions in my_global.h
Linking C executable mysqlbackup
libinnodb.a(dict0mem.c.o): In function `dict_mem_foreign_table_name_lookup_set':
dict0mem.c:(.text+0x91c): undefined reference to `innobase_get_lower_case_table_names'
libinnodb.a(dict0mem.c.o): In function `dict_mem_referenced_table_name_lookup_set':
dict0mem.c:(.text+0x9fc): undefined reference to `innobase_get_lower_case_table_names'
libinnodb.a(dict0mem.c.o): In function `dict_mem_foreign_table_name_lookup_set':
dict0mem.c:(.text+0x96e): undefined reference to `innobase_casedn_str'
libinnodb.a(dict0mem.c.o): In function `dict_mem_referenced_table_name_lookup_set':
dict0mem.c:(.text+0xa4e): undefined reference to `innobase_casedn_str'
collect2: ld returned 1 exit status
make[2]: *** [mysqlbackup] Error 1
innobase_get_lower_case_table_names
innobase_casedn_str
are functions that are part of ha_innodb.cc that is not part of the build
dict_mem_foreign_table_name_lookup_set
function is not there in the current codebase, meaning we do not use it in MEB.
@ storage/innobase/fil/fil0fil.c
The srv_fast_shutdown variable is declared in
srv0srv.c that is not compiled in the
mysqlbackup codebase.
This throws an undeclared error.
From the Manual
---------------
innodb_fast_shutdown
--------------------
The InnoDB shutdown mode. The default value is 1
as of MySQL 3.23.50, which causes a “fast� shutdown
(the normal type of shutdown). If the value is 0,
InnoDB does a full purge and an insert buffer merge
before a shutdown. These operations can take minutes,
or even hours in extreme cases. If the value is 1,
InnoDB skips these operations at shutdown.
This ideally does not matter from mysqlbackup
@ storage/innobase/ha/ha0ha.c
In file included from /home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/ha/ha0ha.c:34:0:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/btr0sea.h:286:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
make[2]: *** [CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/ha/ha0ha.c.o] Error 1
make[1]: *** [CMakeFiles/innodb.dir/all] Error 2
make: *** [all] Error 2
# include "sync0rw.h" is excluded from hotbackup compilation in dict0dict.h
This causes extern rw_lock_t* btr_search_latch_temp; to throw a failure because
the definition of rw_lock_t is not found.
@ storage/innobase/include/buf0buf.h
Excluding buffer pool functions that are unused from the
MEB codebase.
@ storage/innobase/include/buf0buf.ic
replicated the exclusion of
#include "buf0flu.h"
#include "buf0lru.h"
#include "buf0rea.h"
by looking at the current codebase in <meb-trunk>/src/innodb
@ storage/innobase/include/dict0dict.h
dict_table_x_lock_indexes, dict_table_x_unlock_indexes, dict_table_is_corrupted,
dict_index_is_corrupted, buf_block_buf_fix_inc_func are unused in MEB and was
leading to compilation errors and hence excluded.
@ storage/innobase/include/dict0dict.ic
dict_table_x_lock_indexes, dict_table_x_unlock_indexes, dict_table_is_corrupted,
dict_index_is_corrupted, buf_block_buf_fix_inc_func are unused in MEB and was
leading to compilation errors and hence excluded.
@ storage/innobase/include/log0log.h
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/log0log.h: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/log0log.h:767:2: error: expected specifier-qualifier-list before â⠂¬Ëœmutex_t’
mutex_t definitions were excluded as seen from ambient code
hence excluding definition for log_flush_order_mutex also.
@ storage/innobase/include/os0file.h
Bug in InnoDB code, create_mode should have been create.
@ storage/innobase/include/srv0srv.h
In file included from /home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/buf/buf0buf.c:50:0:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/srv0srv.h: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/srv0srv.h:120:16: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘srv_use_native_aio’
srv_use_native_aio - we do not use native aio of the OS anyway from MEB. MEB does not compile
InnoDB with this option. Hence disabling it.
@ storage/innobase/include/trx0sys.h
[ 56%] Building C object CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c.o
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c: In function ‘trx_sys_read_file_format_id’:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c:1499:20: error: ‘TRX_SYS_FILE_FORMAT_TAG_MAGIC_N’ undeclared (first use in this function)
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c:1499:20: note: each undeclared identifier is reported only once for each function it appears in
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c: At top level:
/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/include/buf0buf.h:607:1: warning: ‘buf_block_buf_fix_inc_func’ declared ‘static’ but never defined
make[2]: *** [CMakeFiles/innodb.dir/home/narayanan/mysql-server/mysql-5.5-meb-rel3.8-innodb-integration-1/storage/innobase/trx/trx0sys.c.o] Error 1
unused calls excluded to enable compilation
@ storage/innobase/mem/mem0dbg.c
excluding #include "ha_prototypes.h" that lead to definitions in ha_innodb.cc
@ storage/innobase/os/os0file.c
InnoDB not compiled with aio support from MEB anyway. Hence excluding this from
the compilation.
@ storage/innobase/page/page0zip.c
page0zip.c:(.text+0x4e9e): undefined reference to `buf_pool_from_block'
collect2: ld returned 1 exit status
buf_pool_from_block defined in buf0buf.ic, most of the file is excluded for compilation of MEB
@ storage/innobase/ut/ut0dbg.c
excluding #include "ha_prototypes.h" since it leads to definitions in ha_innodb.cc
innobase_basename(file) is defined in ha_innodb.cc. Hence excluding that also.
@ storage/innobase/ut/ut0ut.c
cal_tm unused from MEB, was leading to earnings, hence disabling for MEB.
rb://1043
approved by: Sunny Bains
Two internal counters were incremented twice for a single
operations. The counters are:
srv_buf_pool_flushed
buf_lru_flush_page_count
rb://1043
approved by: Sunny Bains
Two internal counters were incremented twice for a single
operations. The counters are:
srv_buf_pool_flushed
buf_lru_flush_page_count
rb://942
approved by: Marko Makela
We don't need to scan LRU for dropping AHI entries when DROPing a table.
AHI entries are already removed when we free up extents for the btree.
rb://942
approved by: Marko Makela
We don't need to scan LRU for dropping AHI entries when DROPing a table.
AHI entries are already removed when we free up extents for the btree.
print page dump
buf_page_print(): Remove the ut_ad(0) from the beginning. Add two flags
(enum buf_page_print_flags) that can be bitwise-ORed together:
BUF_PAGE_PRINT_NO_CRASH:
Do not crash debug builds at the end of buf_page_print().
BUF_PAGE_PRINT_NO_FULL:
Do not print the full page dump. This can be useful when adding
diagnostic printout to flushing or to the doublewrite buffer.
trx_sys_doublewrite_init_or_restore_page(): Replace exit(1) with ut_error,
so that we can get a core dump if this extraordinary condition happens.
rb:924 approved by Sunny Bains
print page dump
buf_page_print(): Remove the ut_ad(0) from the beginning. Add two flags
(enum buf_page_print_flags) that can be bitwise-ORed together:
BUF_PAGE_PRINT_NO_CRASH:
Do not crash debug builds at the end of buf_page_print().
BUF_PAGE_PRINT_NO_FULL:
Do not print the full page dump. This can be useful when adding
diagnostic printout to flushing or to the doublewrite buffer.
trx_sys_doublewrite_init_or_restore_page(): Replace exit(1) with ut_error,
so that we can get a core dump if this extraordinary condition happens.
rb:924 approved by Sunny Bains
This fix does not remove the underlying cause of the assertion
failure. It just works around the problem, allowing a corrupted
secondary index to be fixed by DROP INDEX and CREATE INDEX (or in the
worst case, by re-creating the table).
ibuf_delete(): If the record to be purged is the last one in the page
or it is not delete-marked, refuse to purge it. Instead, write an
error message to the error log and let a debug assertion fail.
ibuf_set_del_mark(): If the record to be delete-marked is not found,
display some more information in the error log and let a debug
assertion fail.
row_undo_mod_del_unmark_sec_and_undo_update(),
row_upd_sec_index_entry(): Let a debug assertion fail when the record
to be delete-marked is not found.
buf_page_print(): Add ut_ad(0) so that corruption will be more
prominent in stress testing with debug binaries. Add ut_ad(0) here and
there where corruption is noticed.
btr_corruption_report(): Display some data on page_is_comp() mismatch.
btr_assert_not_corrupted(): A wrapper around btr_corruption_report().
Assert that page_is_comp() agrees with the table flags.
rb:911 approved by Inaam Rana
This fix does not remove the underlying cause of the assertion
failure. It just works around the problem, allowing a corrupted
secondary index to be fixed by DROP INDEX and CREATE INDEX (or in the
worst case, by re-creating the table).
ibuf_delete(): If the record to be purged is the last one in the page
or it is not delete-marked, refuse to purge it. Instead, write an
error message to the error log and let a debug assertion fail.
ibuf_set_del_mark(): If the record to be delete-marked is not found,
display some more information in the error log and let a debug
assertion fail.
row_undo_mod_del_unmark_sec_and_undo_update(),
row_upd_sec_index_entry(): Let a debug assertion fail when the record
to be delete-marked is not found.
buf_page_print(): Add ut_ad(0) so that corruption will be more
prominent in stress testing with debug binaries. Add ut_ad(0) here and
there where corruption is noticed.
btr_corruption_report(): Display some data on page_is_comp() mismatch.
btr_assert_not_corrupted(): A wrapper around btr_corruption_report().
Assert that page_is_comp() agrees with the table flags.
rb:911 approved by Inaam Rana
InnoDB: Remove HAVE_purify, UNIV_INIT_MEM_TO_ZERO, UNIV_SET_MEM_TO_ZERO.
The compile-time setting HAVE_purify can mask potential bugs.
It is being set in PB2 Valgrind runs. We should simply get rid of it,
and replace it with UNIV_MEM_INVALID() to declare uninitialized memory
as such in Valgrind-instrumented binaries.
os_mem_alloc_large(), ut_malloc_low(): Remove the parameter set_to_zero.
ut_malloc(): Define as a macro that invokes ut_malloc_low().
buf_pool_init(): Never initialize the buffer pool frames. All pages
must be initialized before flushing them to disk.
mem_heap_alloc(): Never initialize the allocated memory block.
os_mem_alloc_nocache(), ut_test_malloc(): Unused function, remove.
rb:813 approved by Jimmy Yang
InnoDB: Remove HAVE_purify, UNIV_INIT_MEM_TO_ZERO, UNIV_SET_MEM_TO_ZERO.
The compile-time setting HAVE_purify can mask potential bugs.
It is being set in PB2 Valgrind runs. We should simply get rid of it,
and replace it with UNIV_MEM_INVALID() to declare uninitialized memory
as such in Valgrind-instrumented binaries.
os_mem_alloc_large(), ut_malloc_low(): Remove the parameter set_to_zero.
ut_malloc(): Define as a macro that invokes ut_malloc_low().
buf_pool_init(): Never initialize the buffer pool frames. All pages
must be initialized before flushing them to disk.
mem_heap_alloc(): Never initialize the allocated memory block.
os_mem_alloc_nocache(), ut_test_malloc(): Unused function, remove.
rb:813 approved by Jimmy Yang
WITH LARGE BUFFER POOL
(Note: this a backport of revno:3472 from mysql-trunk)
rb://845
approved by: Marko
When dropping a table (with an .ibd file i.e.: with
innodb_file_per_table set) we scan entire LRU to invalidate pages from
that table. This can be painful in case of large buffer pools as we hold
the buf_pool->mutex for the scan. Note that gravity of the problem does
not depend on the size of the table. Even with an empty table but a
large and filled up buffer pool we'll end up scanning a very long LRU
list.
The fix is to scan flush_list and just remove the blocks belonging to
the table from the flush_list, marking them as non-dirty. The blocks
are left in the LRU list for eventual eviction due to aging. The
flush_list is typically much smaller than the LRU list but for cases
where it is very long we have the solution of releasing the
buf_pool->mutex after scanning 1K pages.
buf_page_[set|unset]_sticky(): Use new IO-state BUF_IO_PIN to ensure
that a block stays in the flush_list and LRU list when we release
buf_pool->mutex. Previously we have been abusing BUF_IO_READ to achieve
this.
WITH LARGE BUFFER POOL
(Note: this a backport of revno:3472 from mysql-trunk)
rb://845
approved by: Marko
When dropping a table (with an .ibd file i.e.: with
innodb_file_per_table set) we scan entire LRU to invalidate pages from
that table. This can be painful in case of large buffer pools as we hold
the buf_pool->mutex for the scan. Note that gravity of the problem does
not depend on the size of the table. Even with an empty table but a
large and filled up buffer pool we'll end up scanning a very long LRU
list.
The fix is to scan flush_list and just remove the blocks belonging to
the table from the flush_list, marking them as non-dirty. The blocks
are left in the LRU list for eventual eviction due to aging. The
flush_list is typically much smaller than the LRU list but for cases
where it is very long we have the solution of releasing the
buf_pool->mutex after scanning 1K pages.
buf_page_[set|unset]_sticky(): Use new IO-state BUF_IO_PIN to ensure
that a block stays in the flush_list and LRU list when we release
buf_pool->mutex. Previously we have been abusing BUF_IO_READ to achieve
this.
The fix of Bug#12612184 broke crash recovery. When a record that
contains off-page columns (BLOBs) is updated, we must first write redo
log about the BLOB page writes, and only after that write the redo log
about the B-tree changes. The buggy fix would log the B-tree changes
first, meaning that after recovery, we could end up having a record
that contains a null BLOB pointer.
Because we will be redo logging the writes off the off-page columns
before the B-tree changes, we must make sure that the pages chosen for
the off-page columns are free both before and after the B-tree
changes. In this way, the worst thing that can happen in crash
recovery is that the BLOBs are written to free pages, but the B-tree
changes are not applied. The BLOB pages would correctly remain free in
this case. To achieve this, we must allocate the BLOB pages in the
mini-transaction of the B-tree operation. A further quirk is that BLOB
pages are allocated from the same file segment as leaf pages. Because
of this, we must temporarily "hide" any leaf pages that were freed
during the B-tree operation by "fake allocating" them prior to writing
the BLOBs, and freeing them again before the mtr_commit() of the
B-tree operation, in btr_mark_freed_leaves().
btr_cur_mtr_commit_and_start(): Remove this faulty function that was
introduced in the Bug#12612184 fix. The problem that this function was
trying to address was that when we did mtr_commit() the BLOB writes
before the mtr_commit() of the update, the new BLOB pages could have
overwritten clustered index B-tree leaf pages that were freed during
the update. If recovery applied the redo log of the BLOB writes but
did not see the log of the record update, the index tree would be
corrupted. The correct solution is to make the freed clustered index
pages unavailable to the BLOB allocation. This function is also a
likely culprit of InnoDB hangs that were observed when testing the
Bug#12612184 fix.
btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of
a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs,
or freed (nonfree=FALSE) before committing the mini-transaction.
btr_freed_leaves_validate(): A debug function for checking that all
clustered index leaf pages that have been marked free in the
mini-transaction are consistent (have not been zeroed out).
btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the
number of the allocated page, or FIL_NULL if out of space. Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or if this is a "fake allocation"
(init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE).
btr_page_alloc(): Add the parameter init_mtr, allowing the page to be
initialized and X-latched in a different mini-transaction than the one
that is used for the allocation. Invoke btr_page_alloc_low(). If a
clustered index leaf page was previously freed in mtr, remove it from
the memo of previously freed pages.
btr_page_free(): Assert that the page is a B-tree page and it has been
X-latched by the mini-transaction. If the freed page was a leaf page
of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to
the mini-transaction.
btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr,
which is NULL (old behaviour in inserts) and the same as local_mtr in
updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it
instead of the mini-transaction that is used for writing the BLOBs.
fsp_alloc_from_free_frag(): Refactored from
fsp_alloc_free_page(). Allocate the specified page from a partially
free extent.
fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or NULL if this is a "fake allocation"
that prevents the reuse of a previously freed B-tree page for BLOB
storage. If init_mtr==NULL, try harder to reallocate the specified page
and assert that it succeeded.
fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for
specifying the mini-transaction where the page should be initialized.
Do not allow init_mtr == NULL, because this function is never to be
used for "fake allocations".
mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag
mtr->freed_clust_leaf for quickly determining if any
MTR_MEMO_FREE_CLUST_LEAF operations have been posted.
row_ins_index_entry_low(): When columns are being made off-page in
insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass
the mini-transaction as the alloc_mtr to
btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
row_build(): Correct a comment, and add a debug assertion that a
record that contains NULL BLOB pointers must be a fresh insert.
row_upd_clust_rec(): When columns are being moved off-page, invoke
btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as
the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
buf_reset_check_index_page_at_flush(): Remove. The function
fsp_init_file_page_low() already sets
bpage->check_index_page_at_flush=FALSE.
There is a known issue in tablespace extension. If the request to
allocate a BLOB page leads to the tablespace being extended, crash
recovery could see BLOB writes to pages that are off the tablespace
file bounds. This should trigger an assertion failure in fil_io() at
crash recovery. The safe thing would be to write redo log about the
tablespace extension to the mini-transaction of the BLOB write, not to
the mini-transaction of the record update. However, there is no redo
log record for file extension in the current redo log format.
rb:693 approved by Sunny Bains
The fix of Bug#12612184 broke crash recovery. When a record that
contains off-page columns (BLOBs) is updated, we must first write redo
log about the BLOB page writes, and only after that write the redo log
about the B-tree changes. The buggy fix would log the B-tree changes
first, meaning that after recovery, we could end up having a record
that contains a null BLOB pointer.
Because we will be redo logging the writes off the off-page columns
before the B-tree changes, we must make sure that the pages chosen for
the off-page columns are free both before and after the B-tree
changes. In this way, the worst thing that can happen in crash
recovery is that the BLOBs are written to free pages, but the B-tree
changes are not applied. The BLOB pages would correctly remain free in
this case. To achieve this, we must allocate the BLOB pages in the
mini-transaction of the B-tree operation. A further quirk is that BLOB
pages are allocated from the same file segment as leaf pages. Because
of this, we must temporarily "hide" any leaf pages that were freed
during the B-tree operation by "fake allocating" them prior to writing
the BLOBs, and freeing them again before the mtr_commit() of the
B-tree operation, in btr_mark_freed_leaves().
btr_cur_mtr_commit_and_start(): Remove this faulty function that was
introduced in the Bug#12612184 fix. The problem that this function was
trying to address was that when we did mtr_commit() the BLOB writes
before the mtr_commit() of the update, the new BLOB pages could have
overwritten clustered index B-tree leaf pages that were freed during
the update. If recovery applied the redo log of the BLOB writes but
did not see the log of the record update, the index tree would be
corrupted. The correct solution is to make the freed clustered index
pages unavailable to the BLOB allocation. This function is also a
likely culprit of InnoDB hangs that were observed when testing the
Bug#12612184 fix.
btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of
a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs,
or freed (nonfree=FALSE) before committing the mini-transaction.
btr_freed_leaves_validate(): A debug function for checking that all
clustered index leaf pages that have been marked free in the
mini-transaction are consistent (have not been zeroed out).
btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the
number of the allocated page, or FIL_NULL if out of space. Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or if this is a "fake allocation"
(init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE).
btr_page_alloc(): Add the parameter init_mtr, allowing the page to be
initialized and X-latched in a different mini-transaction than the one
that is used for the allocation. Invoke btr_page_alloc_low(). If a
clustered index leaf page was previously freed in mtr, remove it from
the memo of previously freed pages.
btr_page_free(): Assert that the page is a B-tree page and it has been
X-latched by the mini-transaction. If the freed page was a leaf page
of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to
the mini-transaction.
btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr,
which is NULL (old behaviour in inserts) and the same as local_mtr in
updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it
instead of the mini-transaction that is used for writing the BLOBs.
fsp_alloc_from_free_frag(): Refactored from
fsp_alloc_free_page(). Allocate the specified page from a partially
free extent.
fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or NULL if this is a "fake allocation"
that prevents the reuse of a previously freed B-tree page for BLOB
storage. If init_mtr==NULL, try harder to reallocate the specified page
and assert that it succeeded.
fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for
specifying the mini-transaction where the page should be initialized.
Do not allow init_mtr == NULL, because this function is never to be
used for "fake allocations".
mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag
mtr->freed_clust_leaf for quickly determining if any
MTR_MEMO_FREE_CLUST_LEAF operations have been posted.
row_ins_index_entry_low(): When columns are being made off-page in
insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass
the mini-transaction as the alloc_mtr to
btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
row_build(): Correct a comment, and add a debug assertion that a
record that contains NULL BLOB pointers must be a fresh insert.
row_upd_clust_rec(): When columns are being moved off-page, invoke
btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as
the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
buf_reset_check_index_page_at_flush(): Remove. The function
fsp_init_file_page_low() already sets
bpage->check_index_page_at_flush=FALSE.
There is a known issue in tablespace extension. If the request to
allocate a BLOB page leads to the tablespace being extended, crash
recovery could see BLOB writes to pages that are off the tablespace
file bounds. This should trigger an assertion failure in fil_io() at
crash recovery. The safe thing would be to write redo log about the
tablespace extension to the mini-transaction of the BLOB write, not to
the mini-transaction of the record update. However, there is no redo
log record for file extension in the current redo log format.
rb:693 approved by Sunny Bains
Also addressed issues in bug #11745133, where we could mark a table
corrupted instead of crashing the server when found a corrupted buffer/page
if the table created with innodb_file_per_table on.
Also addressed issues in bug #11745133, where we could mark a table
corrupted instead of crashing the server when found a corrupted buffer/page
if the table created with innodb_file_per_table on.
------------------------------------------------------------
revno 2876.244.305
revision id marko.makela@oracle.com-20110413082211-e6ouhjz5rmqxcqap
parent marko.makela@oracle.com-20110413075948-kvytmc37ye1nt7d9
committer Marko Mäkelä <marko.makela@oracle.com>
branch nick 5.6-innodb
timestamp Wed 2011-04-13 11:22:11 +0300
message:
Suppress the Bug #58815 (Bug #11765812) assertion failure.
buf_page_get_gen(): Introduce BUF_GET_POSSIBLY_FREED for suppressing the
check that the file page must not have been freed.
btr_estimate_n_rows_in_range_on_level(): Pass BUF_GET_POSSIBLY_FREED and
explain in the comments why this is needed and why it should be mostly
harmless to ignore the problem. If InnoDB had always initialized all
unused fields in data files, no problem would exist.
This change does not fix the bug, it just "shoots the messenger".
rb:647 approved by Jimmy Yang
------------------------------------------------------------
revno 2876.244.305
revision id marko.makela@oracle.com-20110413082211-e6ouhjz5rmqxcqap
parent marko.makela@oracle.com-20110413075948-kvytmc37ye1nt7d9
committer Marko Mäkelä <marko.makela@oracle.com>
branch nick 5.6-innodb
timestamp Wed 2011-04-13 11:22:11 +0300
message:
Suppress the Bug #58815 (Bug #11765812) assertion failure.
buf_page_get_gen(): Introduce BUF_GET_POSSIBLY_FREED for suppressing the
check that the file page must not have been freed.
btr_estimate_n_rows_in_range_on_level(): Pass BUF_GET_POSSIBLY_FREED and
explain in the comments why this is needed and why it should be mostly
harmless to ignore the problem. If InnoDB had always initialized all
unused fields in data files, no problem would exist.
This change does not fix the bug, it just "shoots the messenger".
rb:647 approved by Jimmy Yang
Remove most references to thread id in InnoDB. Three references
remain: the current holder of a mutex, and the current x-lock holder
of a rw-lock, and some references in UNIV_SYNC_DEBUG checks. This
allows MySQL to change the thread associated to a client connection.
Tighten the UNIV_SYNC_DEBUG checks, trying to ensure that no InnoDB
mutex or x-lock is being held when returning control to MySQL. The
only semaphore that may be held is the btr_search_latch in shared mode.
sync_thread_levels_empty_except_dict(): A wrapper for
sync_thread_levels_empty_gen(TRUE).
sync_thread_levels_nonempty_trx(): Check that the current thread is
not holding any InnoDB semaphores, except btr_search_latch if
trx->has_search_latch.
sync_thread_levels_empty(): Unused function; remove.
trx_t: Remove mysql_thread_id and mysql_process_no.
srv_slot_t: Remove id and handle.
row_search_for_mysql(), srv_conc_enter_innodb(),
srv_conc_force_enter_innodb(), srv_conc_force_exit_innodb(),
srv_conc_exit_innodb(), srv_suspend_mysql_thread: Assert
!sync_thread_levels_nonempty_trx().
rb:634 approved by Sunny Bains
Remove most references to thread id in InnoDB. Three references
remain: the current holder of a mutex, and the current x-lock holder
of a rw-lock, and some references in UNIV_SYNC_DEBUG checks. This
allows MySQL to change the thread associated to a client connection.
Tighten the UNIV_SYNC_DEBUG checks, trying to ensure that no InnoDB
mutex or x-lock is being held when returning control to MySQL. The
only semaphore that may be held is the btr_search_latch in shared mode.
sync_thread_levels_empty_except_dict(): A wrapper for
sync_thread_levels_empty_gen(TRUE).
sync_thread_levels_nonempty_trx(): Check that the current thread is
not holding any InnoDB semaphores, except btr_search_latch if
trx->has_search_latch.
sync_thread_levels_empty(): Unused function; remove.
trx_t: Remove mysql_thread_id and mysql_process_no.
srv_slot_t: Remove id and handle.
row_search_for_mysql(), srv_conc_enter_innodb(),
srv_conc_force_enter_innodb(), srv_conc_force_exit_innodb(),
srv_conc_exit_innodb(), srv_suspend_mysql_thread: Assert
!sync_thread_levels_nonempty_trx().
rb:634 approved by Sunny Bains
ibuf_inside(), ibuf_enter(), ibuf_exit(): Add the parameter mtr. The
flag is no longer kept in the thread-local storage but in the
mini-transaction (mtr->inside_ibuf).
mtr_start(): Clean up the comment and remove the unused return value.
mtr_commit(): Assert !ibuf_inside(mtr) in debug builds.
ibuf_mtr_start(): Like mtr_start(), but sets the flag.
ibuf_mtr_commit(), ibuf_btr_pcur_commit_specify_mtr(): Wrappers that
assert ibuf_inside().
buf_page_get_zip(), buf_page_init_for_read(),
buf_read_ibuf_merge_pages(), fil_io(), ibuf_free_excess_pages(),
ibuf_contract_ext(): Remove assertions on ibuf_inside(), because a
mini-transaction is not available.
buf_read_ahead_linear(): Add the parameter inside_ibuf.
ibuf_restore_pos(): When this function returns FALSE, it commits mtr
and must therefore do ibuf_exit(mtr).
ibuf_delete_rec(): This function commits mtr and must therefore do
ibuf_exit(mtr).
ibuf_rec_get_page_no(), ibuf_rec_get_space(), ibuf_rec_get_info(),
ibuf_rec_get_op_type(), ibuf_build_entry_from_ibuf_rec(),
ibuf_rec_get_volume(), ibuf_get_merge_page_nos(),
ibuf_get_volume_buffered_count(), ibuf_get_entry_counter_low(): Add
the parameter mtr in debug builds, for asserting ibuf_inside(mtr).
rb:585 approved by Sunny Bains
ibuf_inside(), ibuf_enter(), ibuf_exit(): Add the parameter mtr. The
flag is no longer kept in the thread-local storage but in the
mini-transaction (mtr->inside_ibuf).
mtr_start(): Clean up the comment and remove the unused return value.
mtr_commit(): Assert !ibuf_inside(mtr) in debug builds.
ibuf_mtr_start(): Like mtr_start(), but sets the flag.
ibuf_mtr_commit(), ibuf_btr_pcur_commit_specify_mtr(): Wrappers that
assert ibuf_inside().
buf_page_get_zip(), buf_page_init_for_read(),
buf_read_ibuf_merge_pages(), fil_io(), ibuf_free_excess_pages(),
ibuf_contract_ext(): Remove assertions on ibuf_inside(), because a
mini-transaction is not available.
buf_read_ahead_linear(): Add the parameter inside_ibuf.
ibuf_restore_pos(): When this function returns FALSE, it commits mtr
and must therefore do ibuf_exit(mtr).
ibuf_delete_rec(): This function commits mtr and must therefore do
ibuf_exit(mtr).
ibuf_rec_get_page_no(), ibuf_rec_get_space(), ibuf_rec_get_info(),
ibuf_rec_get_op_type(), ibuf_build_entry_from_ibuf_rec(),
ibuf_rec_get_volume(), ibuf_get_merge_page_nos(),
ibuf_get_volume_buffered_count(), ibuf_get_entry_counter_low(): Add
the parameter mtr in debug builds, for asserting ibuf_inside(mtr).
rb:585 approved by Sunny Bains