mirror of
https://github.com/MariaDB/server.git
synced 2025-01-17 04:22:27 +01:00
6bbca54d7d
Unit test for recovery: runs ma_test1 and ma_test2 (both only with INSERTs and DELETEs; UPDATEs disabled as not handled by recovery) then moves the tables elswhere; recreates tables from the log, and compares and fails if there is a difference. Passes now. Most of maria_read_log.c moved to ma_recovery.c, as it will be re-used for recovery-from-ha_maria. Bugfixes of applying of REDO_INSERT, REDO_PURGE_ROW. Applying of REDO_PURGE_BLOCKS, REDO_DELETE_ALL, REDO_DROP_TABLE, UNDO_ROW_INSERT (in REDO phase only, i.e. just doing records++), UNDO_ROW_DELETE, UNDO_ROW_PURGE. Code cleanups. Monty: please look for "QQ". Sanja: please look for "Sanja". Future tasks: recovery of the bitmap (easy), recovery of the state (make it idempotent), more REDOs (Monty to work on REDO_UPDATE?), UNDO phase... Pushing this cset as it looks safe, contains test and bugfixes which will help Monty implement applying of REDO_UPDATE. sql/handler.cc: typo storage/maria/Makefile.am: Adding ma_test_recovery (which ma_test_all invokes, and which can also be run alone). Most of maria_read_log.c moved to ma_recovery.c storage/maria/ha_maria.cc: comments storage/maria/ma_bitmap.c: fixing comments. 2 -> sizeof(maria_bitmap_marker). Bitmap-related part of _ma_initialize_datafile() moves in bitmap module. Now putting the "bm" signature when creating the first bitmap page (it used to happen only at next open, but that caused an annoying difference when testing Recovery if the original run didn't open the table, and it looks more logical like this: it goes to disk only with its signature correct); see the "QQ" comment towards the _ma_initialize_data_file() call in ma_create.c for more). When reading a bitmap page, verify its signature (happens when normally using the table or when CHECKing it; not when REPAIRing it). storage/maria/ma_blockrec.c: * no need to sync the data file if table is not transactional * Comments, code cleanup (log-related data moved to log-related code block, int5store->page_store). * Store the table's short id into LOGREC_UNDO_ROW_PURGE, like we do for other records (though this record will soon be replaced with a CLR). * If "page" is 1 it means the page which extends from byte page*block_size+1 to (page+1)*block_size (byte number 1 being the first byte of the file). The last byte of the file is data_file_length (same convention). A new page needs to be created if the last byte of the page is beyond the last byte of the file, i.e. (page+1)*block_size+1 > data_file_length, so we correct the test (bug found when testing log applying for ma_test1 -M -T --skip-update). * update the page's LSN when removing a row from it during execution of a REDO_PURGE_ROW record (bug found when testing log applying for ma_test1 -M -T --skip-update). * applying of REDO_PURGE_BLOCKs (limited to a one-page range for now). storage/maria/ma_blockrec.h: new functions. maria_bitmap_marker does not need to be exported. storage/maria/ma_close.c: we can always flush the table's state when closing the last instance of the table. And it is needed for maria_read_log (as it does not use maria_lock_database()). storage/maria/ma_control_file.c: when in Recovery, some assertions should not be used. storage/maria/ma_control_file.h: double-inclusion safe storage/maria/ma_create.c: during recovery, don't log records. Comments. Moving the creation of the first bitmap page to ma_bitmap.c storage/maria/ma_delete_table.c: during recovery, don't log records. Log the end-zero of the dropped table's name, so that recovery can use the string in place without extending it to fit an end zero. storage/maria/ma_loghandler.c: * inwrite_rec_hook also needs access to the MARIA_SHARE, like prewrite_rec_hook. This will be needed to update share->records_diff (in the upcoming patch "recovery of the state"). * LOG_DESC::record_ends_group changed to an enum. * LOG_DESC for LOGREC_REDO_PURGE_BLOCKS and LOGREC_UNDO_ROW_PURGE corrected * Sanja please see the @todo LOG BUG * avoiding DBUG_RETURN(func()) as it gives confusing debug traces. storage/maria/ma_loghandler.h: - log write hooks called while the log's lock is held (inwrite_rec_hook) now need the MARIA_SHARE, like prewrite_rec_hook already had - instead of a bool saying if this record's type ends groups or not, we refine: it may not end a group, it may end a group, or it may be a group in itself. Imagine that we had a physical write failure to a table before we log the UNDO, we still end up in external_lock(F_UNLCK) and then we log a COMMIT: we don't want to consider this COMMIT as ending the group of REDOs (don't want to execute those REDOs during Recovery), that's why we say "COMMIT is a group in itself, it aborts any previous group". This also gives one more sanity check in maria_read_log. storage/maria/ma_recovery.c: New Recovery code, replacing the old pseudocode. Most of maria_read_log moved here. Call-able from ha_maria, but not enabled yet. Compared to the previous version of maria_read_log, some bugs have been fixed, debugging output can go to stdout or a disk file (for now it's useful for me, later it can be changed), execution of REDO_DROP_TABLE, REDO_DELETE_ALL, REDO_PURGE_BLOCKS has been added. Duplicate code has been factored into functions. We abort an unfinished group of records if we see a record which is a group in itself (like COMMIT). No need for maria_panic() after a bug (which caused tables to not be closed) was fixed; if there is yet another bug I prefer to see it. When opening a table for Recovery, set data_file_length and key_file_length to their real physical value (these are the easiest state members to restore :). Warn us if the last page was truncated (but Recovery handles it). MARIA_SHARE::state::state::records is now partly recovered (not idempotent, but works if recreating tables from scracth). When applying a REDO to a page, stamp it with the UNDO's LSN (current_group_end_lsn), not with the REDO's LSN; it makes the table more identical to the original table (easier to compare the two tables in the end). Big thing missing: some types of REDOs are not handled, and the UNDO phase does not exist (missing functions to execute UNDOs to actually rollback). So for now tests are only inserting/deleting a few 100 rows, closing the table and seeing if the log is applied ok; it works. UPDATE not handled. storage/maria/ma_recovery.h: new functions: ma_recover() for recovery from inside ha_maria; _ma_apply_log() for maria_read_log (ma_recover() calls _ma_apply_log()). Btw, we need to not use the word "recover" for REPAIR/maria_chk anymore. storage/maria/ma_rename.c: don't write log records during recovery storage/maria/ma_test2.c: - fail if maria_info() or other subtests find some wrong information - new option -g to skip updates. - init the translog before creating the table, so that log applying can work. - in "#if 0" you'll see some fixed bugs (will be removed). storage/maria/ma_test_all.sh: cleanup files. Test log applying. storage/maria/maria_read_log.c: most of the logic moves to ma_recovery.c to be shared between maria_read_log and recovery-from-inside-mysqld. See ma_recovery.c for additional changes made to the moved code. storage/maria/ma_test_recovery: unit test for Recovery. Tests insert and delete, REDO_UPDATE not yet coded. Script is called from ma_test_all. Can run standalone.
320 lines
11 KiB
C
320 lines
11 KiB
C
/* Copyright (C) 2006 MySQL AB & MySQL Finland AB & TCX DataKonsult AB
|
|
|
|
This program is free software; you can redistribute it and/or modify
|
|
it under the terms of the GNU General Public License as published by
|
|
the Free Software Foundation; version 2 of the License.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program; if not, write to the Free Software
|
|
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
|
|
|
|
/*
|
|
WL#3234 Maria control file
|
|
First version written by Guilhem Bichot on 2006-04-27.
|
|
Does not compile yet.
|
|
*/
|
|
|
|
#include "maria_def.h"
|
|
|
|
/* Here is the implementation of this module */
|
|
|
|
/*
|
|
a control file contains 3 objects: magic string, LSN of last checkpoint,
|
|
number of last log.
|
|
*/
|
|
|
|
/* total size should be < sector size for atomic write operation */
|
|
#define CONTROL_FILE_MAGIC_STRING "\xfe\xfe\xc\1MACF"
|
|
#define CONTROL_FILE_MAGIC_STRING_OFFSET 0
|
|
#define CONTROL_FILE_MAGIC_STRING_SIZE (sizeof(CONTROL_FILE_MAGIC_STRING)-1)
|
|
#define CONTROL_FILE_CHECKSUM_OFFSET (CONTROL_FILE_MAGIC_STRING_OFFSET + CONTROL_FILE_MAGIC_STRING_SIZE)
|
|
#define CONTROL_FILE_CHECKSUM_SIZE 4
|
|
#define CONTROL_FILE_LSN_OFFSET (CONTROL_FILE_CHECKSUM_OFFSET + CONTROL_FILE_CHECKSUM_SIZE)
|
|
#define CONTROL_FILE_LSN_SIZE LSN_STORE_SIZE
|
|
#define CONTROL_FILE_FILENO_OFFSET (CONTROL_FILE_LSN_OFFSET + CONTROL_FILE_LSN_SIZE)
|
|
#define CONTROL_FILE_FILENO_SIZE 4
|
|
#define CONTROL_FILE_SIZE (CONTROL_FILE_FILENO_OFFSET + CONTROL_FILE_FILENO_SIZE)
|
|
|
|
/* This module owns these two vars. */
|
|
LSN last_checkpoint_lsn= LSN_IMPOSSIBLE;
|
|
uint32 last_logno= FILENO_IMPOSSIBLE;
|
|
|
|
/**
|
|
@brief If log's lock should be asserted when writing to control file.
|
|
|
|
Can be re-used by any function which needs to be thread-safe except when
|
|
it is called at startup.
|
|
*/
|
|
my_bool maria_multi_threaded= FALSE;
|
|
/** @brief if currently doing a recovery */
|
|
my_bool maria_in_recovery= FALSE;
|
|
|
|
/*
|
|
Control file is less then 512 bytes (a disk sector),
|
|
to be as atomic as possible
|
|
*/
|
|
static int control_file_fd= -1;
|
|
|
|
/*
|
|
@brief Initialize control file subsystem
|
|
|
|
Looks for the control file. If none and creation is requested, creates file.
|
|
If present, reads it to find out last checkpoint's LSN and last log, updates
|
|
the last_checkpoint_lsn and last_logno global variables.
|
|
Called at engine's start.
|
|
|
|
@param create_if_missing
|
|
|
|
@note
|
|
The format of the control file is:
|
|
4 bytes: magic string
|
|
4 bytes: checksum of the following bytes
|
|
4 bytes: number of log where last checkpoint is
|
|
4 bytes: offset in log where last checkpoint is
|
|
4 bytes: number of last log
|
|
|
|
@return Operation status
|
|
@retval 0 OK
|
|
@retval 1 Error (in which case the file is left closed)
|
|
*/
|
|
CONTROL_FILE_ERROR ma_control_file_create_or_open(my_bool create_if_missing)
|
|
{
|
|
char buffer[CONTROL_FILE_SIZE];
|
|
char name[FN_REFLEN];
|
|
MY_STAT stat_buff;
|
|
my_bool create_file;
|
|
int open_flags= O_BINARY | /*O_DIRECT |*/ O_RDWR;
|
|
int error= CONTROL_FILE_UNKNOWN_ERROR;
|
|
DBUG_ENTER("ma_control_file_create_or_open");
|
|
|
|
/*
|
|
If you change sizes in the #defines, you at least have to change the
|
|
"*store" and "*korr" calls in this file, and can even create backward
|
|
compatibility problems. Beware!
|
|
*/
|
|
DBUG_ASSERT(CONTROL_FILE_LSN_SIZE == (3+4));
|
|
DBUG_ASSERT(CONTROL_FILE_FILENO_SIZE == 4);
|
|
|
|
if (control_file_fd >= 0) /* already open */
|
|
DBUG_RETURN(0);
|
|
|
|
if (fn_format(name, CONTROL_FILE_BASE_NAME,
|
|
maria_data_root, "", MYF(MY_WME)) == NullS)
|
|
DBUG_RETURN(CONTROL_FILE_UNKNOWN_ERROR);
|
|
|
|
create_file= test(my_access(name,F_OK));
|
|
|
|
if (create_file)
|
|
{
|
|
if (!create_if_missing)
|
|
DBUG_RETURN(CONTROL_FILE_MISSING);
|
|
if ((control_file_fd= my_create(name, 0,
|
|
open_flags, MYF(MY_SYNC_DIR))) < 0)
|
|
DBUG_RETURN(CONTROL_FILE_UNKNOWN_ERROR);
|
|
|
|
/*
|
|
To be safer we should make sure that there are no logs or data/index
|
|
files around (indeed it could be that the control file alone was deleted
|
|
or not restored, and we should not go on with life at this point).
|
|
|
|
TODO: For now we trust (this is alpha version), but for beta if would
|
|
be great to verify.
|
|
|
|
We could have a tool which can rebuild the control file, by reading the
|
|
directory of logs, finding the newest log, reading it to find last
|
|
checkpoint... Slow but can save your db. For this to be possible, we
|
|
must always write to the control file right after writing the checkpoint
|
|
log record, and do nothing in between (i.e. the checkpoint must be
|
|
usable as soon as it has been written to the log).
|
|
*/
|
|
|
|
/* init the file with these "undefined" values */
|
|
DBUG_RETURN(ma_control_file_write_and_force(LSN_IMPOSSIBLE,
|
|
FILENO_IMPOSSIBLE,
|
|
CONTROL_FILE_UPDATE_ALL));
|
|
}
|
|
|
|
/* Otherwise, file exists */
|
|
|
|
if ((control_file_fd= my_open(name, open_flags, MYF(MY_WME))) < 0)
|
|
goto err;
|
|
|
|
if (my_stat(name, &stat_buff, MYF(MY_WME)) == NULL)
|
|
goto err;
|
|
|
|
if ((uint)stat_buff.st_size < CONTROL_FILE_SIZE)
|
|
{
|
|
/*
|
|
Given that normally we write only a sector and it's atomic, the only
|
|
possibility for a file to be of too short size is if we crashed at the
|
|
very first startup, between file creation and file write. Quite unlikely
|
|
(and can be made even more unlikely by doing this: create a temp file,
|
|
write it, and then rename it to be the control file).
|
|
What's more likely is if someone forgot to restore the control file,
|
|
just did a "touch control" to try to get Maria to start, or if the
|
|
disk/filesystem has a problem.
|
|
So let's be rigid.
|
|
*/
|
|
/*
|
|
TODO: store a message "too small file" somewhere, so that it goes to
|
|
MySQL's error log at startup.
|
|
*/
|
|
error= CONTROL_FILE_TOO_SMALL;
|
|
goto err;
|
|
}
|
|
|
|
if ((uint)stat_buff.st_size > CONTROL_FILE_SIZE)
|
|
{
|
|
/* TODO: store "too big file" message */
|
|
error= CONTROL_FILE_TOO_BIG;
|
|
goto err;
|
|
}
|
|
|
|
if (my_read(control_file_fd, buffer, CONTROL_FILE_SIZE,
|
|
MYF(MY_FNABP | MY_WME)))
|
|
goto err;
|
|
if (memcmp(buffer + CONTROL_FILE_MAGIC_STRING_OFFSET,
|
|
CONTROL_FILE_MAGIC_STRING, CONTROL_FILE_MAGIC_STRING_SIZE))
|
|
{
|
|
/* TODO: store message "bad magic string" somewhere */
|
|
error= CONTROL_FILE_BAD_MAGIC_STRING;
|
|
goto err;
|
|
}
|
|
if (my_checksum(0, buffer + CONTROL_FILE_LSN_OFFSET,
|
|
CONTROL_FILE_SIZE - CONTROL_FILE_LSN_OFFSET) !=
|
|
uint4korr(buffer + CONTROL_FILE_CHECKSUM_OFFSET))
|
|
{
|
|
/* TODO: store message "checksum mismatch" somewhere */
|
|
error= CONTROL_FILE_BAD_CHECKSUM;
|
|
goto err;
|
|
}
|
|
last_checkpoint_lsn= lsn_korr(buffer + CONTROL_FILE_LSN_OFFSET);
|
|
last_logno= uint4korr(buffer + CONTROL_FILE_FILENO_OFFSET);
|
|
|
|
DBUG_RETURN(0);
|
|
err:
|
|
ma_control_file_end();
|
|
DBUG_RETURN(error);
|
|
}
|
|
|
|
|
|
/*
|
|
Write information durably to the control file; stores this information into
|
|
the last_checkpoint_lsn and last_logno global variables.
|
|
Called when we have created a new log (after syncing this log's creation)
|
|
and when we have written a checkpoint (after syncing this log record).
|
|
Variables last_checkpoint_lsn and last_logno must be protected by caller
|
|
using log's lock, unless this function is called at startup.
|
|
|
|
SYNOPSIS
|
|
ma_control_file_write_and_force()
|
|
checkpoint_lsn LSN of last checkpoint
|
|
logno last log file number
|
|
objs_to_write which of the arguments should be used as new values
|
|
(for example, CONTROL_FILE_UPDATE_ONLY_LSN will not
|
|
write the logno argument to the control file and will
|
|
not update the last_logno global variable); can be:
|
|
CONTROL_FILE_UPDATE_ALL
|
|
CONTROL_FILE_UPDATE_ONLY_LSN
|
|
CONTROL_FILE_UPDATE_ONLY_LOGNO.
|
|
|
|
NOTE
|
|
We always want to do one single my_pwrite() here to be as atomic as
|
|
possible.
|
|
|
|
RETURN
|
|
0 - OK
|
|
1 - Error
|
|
*/
|
|
|
|
int ma_control_file_write_and_force(const LSN checkpoint_lsn, uint32 logno,
|
|
uint objs_to_write)
|
|
{
|
|
char buffer[CONTROL_FILE_SIZE];
|
|
my_bool update_checkpoint_lsn= FALSE, update_logno= FALSE;
|
|
DBUG_ENTER("ma_control_file_write_and_force");
|
|
|
|
DBUG_ASSERT(control_file_fd >= 0); /* must be open */
|
|
#ifndef DBUG_OFF
|
|
if (maria_multi_threaded)
|
|
translog_lock_assert_owner();
|
|
#endif
|
|
|
|
memcpy(buffer + CONTROL_FILE_MAGIC_STRING_OFFSET,
|
|
CONTROL_FILE_MAGIC_STRING, CONTROL_FILE_MAGIC_STRING_SIZE);
|
|
|
|
if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LSN)
|
|
update_checkpoint_lsn= TRUE;
|
|
else if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LOGNO)
|
|
update_logno= TRUE;
|
|
else if (objs_to_write == CONTROL_FILE_UPDATE_ALL)
|
|
update_checkpoint_lsn= update_logno= TRUE;
|
|
else /* incorrect value of objs_to_write */
|
|
DBUG_ASSERT(0);
|
|
|
|
if (update_checkpoint_lsn)
|
|
lsn_store(buffer + CONTROL_FILE_LSN_OFFSET, checkpoint_lsn);
|
|
else /* store old value == change nothing */
|
|
lsn_store(buffer + CONTROL_FILE_LSN_OFFSET, last_checkpoint_lsn);
|
|
|
|
if (update_logno)
|
|
int4store(buffer + CONTROL_FILE_FILENO_OFFSET, logno);
|
|
else
|
|
int4store(buffer + CONTROL_FILE_FILENO_OFFSET, last_logno);
|
|
|
|
{
|
|
uint32 sum= (uint32)
|
|
my_checksum(0, buffer + CONTROL_FILE_LSN_OFFSET,
|
|
CONTROL_FILE_SIZE - CONTROL_FILE_LSN_OFFSET);
|
|
int4store(buffer + CONTROL_FILE_CHECKSUM_OFFSET, sum);
|
|
}
|
|
|
|
if (my_pwrite(control_file_fd, buffer, sizeof(buffer),
|
|
0, MYF(MY_FNABP | MY_WME)) ||
|
|
my_sync(control_file_fd, MYF(MY_WME)))
|
|
DBUG_RETURN(1);
|
|
|
|
if (update_checkpoint_lsn)
|
|
last_checkpoint_lsn= checkpoint_lsn;
|
|
if (update_logno)
|
|
last_logno= logno;
|
|
|
|
DBUG_RETURN(0);
|
|
}
|
|
|
|
|
|
/*
|
|
Free resources taken by control file subsystem
|
|
|
|
SYNOPSIS
|
|
ma_control_file_end()
|
|
*/
|
|
|
|
int ma_control_file_end()
|
|
{
|
|
int close_error;
|
|
DBUG_ENTER("ma_control_file_end");
|
|
|
|
if (control_file_fd < 0) /* already closed */
|
|
DBUG_RETURN(0);
|
|
|
|
close_error= my_close(control_file_fd, MYF(MY_WME));
|
|
/*
|
|
As my_close() frees structures even if close() fails, we do the same,
|
|
i.e. we mark the file as closed in all cases.
|
|
*/
|
|
control_file_fd= -1;
|
|
/*
|
|
As this module owns these variables, closing the module forbids access to
|
|
them (just a safety):
|
|
*/
|
|
last_checkpoint_lsn= LSN_IMPOSSIBLE;
|
|
last_logno= FILENO_IMPOSSIBLE;
|
|
|
|
DBUG_RETURN(close_error);
|
|
}
|