mariadb/storage/maria/ma_control_file.c
unknown 6bbca54d7d WL#3072 - Maria recovery
Unit test for recovery: runs ma_test1 and ma_test2 (both only with
INSERTs and DELETEs; UPDATEs disabled as not handled by recovery)
then moves the tables elswhere; recreates tables from the log, and
compares and fails if there is a difference. Passes now.
Most of maria_read_log.c moved to ma_recovery.c, as it will be re-used
for recovery-from-ha_maria.
Bugfixes of applying of REDO_INSERT, REDO_PURGE_ROW.
Applying of REDO_PURGE_BLOCKS, REDO_DELETE_ALL, REDO_DROP_TABLE,
UNDO_ROW_INSERT (in REDO phase only, i.e. just doing records++),
UNDO_ROW_DELETE, UNDO_ROW_PURGE.
Code cleanups.
Monty: please look for "QQ". Sanja: please look for "Sanja".
Future tasks: recovery of the bitmap (easy), recovery of the state
(make it idempotent), more REDOs (Monty to work on
REDO_UPDATE?), UNDO phase...
Pushing this cset as it looks safe, contains test and bugfixes which
will help Monty implement applying of REDO_UPDATE.


sql/handler.cc:
  typo
storage/maria/Makefile.am:
  Adding ma_test_recovery (which ma_test_all invokes, and which can
  also be run alone). Most of maria_read_log.c moved to ma_recovery.c
storage/maria/ha_maria.cc:
  comments
storage/maria/ma_bitmap.c:
  fixing comments. 2 -> sizeof(maria_bitmap_marker).
  Bitmap-related part of _ma_initialize_datafile() moves in bitmap module.
  Now putting the "bm" signature when creating the first bitmap page
  (it used to happen only at next open, but that
  caused an annoying difference when testing Recovery if the original
  run didn't open the table, and it looks more
  logical like this: it goes to disk only with its signature correct);
  see the "QQ" comment towards the _ma_initialize_data_file() call
  in ma_create.c for more).
  When reading a bitmap page, verify its signature (happens when normally
  using the table or when CHECKing it; not when REPAIRing it).
storage/maria/ma_blockrec.c:
  * no need to sync the data file if table is not transactional
  * Comments, code cleanup (log-related data moved to log-related code
  block, int5store->page_store).
  * Store the table's short id into LOGREC_UNDO_ROW_PURGE, like we
  do for other records (though this record will soon be replaced
  with a CLR).
  * If "page" is 1 it means the page which extends from byte
  page*block_size+1 to (page+1)*block_size (byte number 1 being
  the first byte of the file). The last byte of the file is
  data_file_length (same convention).
  A new page needs to be created if the last byte of the page is
  beyond the last byte of the file, i.e.
   (page+1)*block_size+1 > data_file_length, so we correct the test
  (bug found when testing log applying for ma_test1 -M -T --skip-update).
  * update the page's LSN when removing a row from it during
  execution of a REDO_PURGE_ROW record (bug found when testing log
  applying for ma_test1 -M -T --skip-update).
  * applying of REDO_PURGE_BLOCKs (limited to a one-page range for now).
storage/maria/ma_blockrec.h:
  new functions. maria_bitmap_marker does not need to be exported.
storage/maria/ma_close.c:
  we can always flush the table's state when closing the last instance
  of the table. And it is needed for maria_read_log (as it does
  not use maria_lock_database()).
storage/maria/ma_control_file.c:
  when in Recovery, some assertions should not be used.
storage/maria/ma_control_file.h:
  double-inclusion safe
storage/maria/ma_create.c:
  during recovery, don't log records. Comments.
  Moving the creation of the first bitmap page to ma_bitmap.c
storage/maria/ma_delete_table.c:
  during recovery, don't log records. Log the end-zero of the dropped
  table's name, so that recovery can use the string in place without
  extending it to fit an end zero.
storage/maria/ma_loghandler.c:
  * inwrite_rec_hook also needs access to the MARIA_SHARE, like
  prewrite_rec_hook. This will be needed to update
  share->records_diff (in the upcoming patch "recovery of the state").
  * LOG_DESC::record_ends_group changed to an enum.
  * LOG_DESC for LOGREC_REDO_PURGE_BLOCKS and LOGREC_UNDO_ROW_PURGE
  corrected
  * Sanja please see the @todo LOG BUG
  * avoiding DBUG_RETURN(func()) as it gives confusing debug traces.
storage/maria/ma_loghandler.h:
  - log write hooks called while the log's lock is held (inwrite_rec_hook)
  now need the MARIA_SHARE, like prewrite_rec_hook already had
  - instead of a bool saying if this record's type ends groups or not,
  we refine: it may not end a group, it may end a group, or it may
  be a group in itself. Imagine that we had a physical write failure
  to a table before we log the UNDO, we still end up in
  external_lock(F_UNLCK) and then we log a COMMIT: we don't want
  to consider this COMMIT as ending the group of REDOs (don't want
  to execute those REDOs during Recovery), that's why we say "COMMIT
  is a group in itself, it aborts any previous group". This also
  gives one more sanity check in maria_read_log.
storage/maria/ma_recovery.c:
  New Recovery code, replacing the old pseudocode.
  Most of maria_read_log moved here.
  Call-able from ha_maria, but not enabled yet.
  Compared to the previous version of maria_read_log, some bugs have
  been fixed, debugging output can go to stdout or a disk file (for now
  it's useful for me, later it can be changed), execution of
  REDO_DROP_TABLE, REDO_DELETE_ALL, REDO_PURGE_BLOCKS has been added. Duplicate code
  has been factored into functions. We abort an unfinished group
  of records if we see a record which is a group in itself (like COMMIT).
  No need for maria_panic() after a bug (which caused tables to not
  be closed) was fixed; if there is yet another bug I prefer to see it.
  When opening a table for Recovery, set data_file_length
  and key_file_length to their real physical value (these are the
  easiest state members to restore :). Warn us if the last page
  was truncated (but Recovery handles it).
  MARIA_SHARE::state::state::records is now partly recovered (not
  idempotent, but works if recreating tables from scracth).
  When applying a REDO to a page, stamp it with the UNDO's LSN
  (current_group_end_lsn), not with the REDO's LSN; it makes
  the table more identical to the original table (easier to compare
  the two tables in the end).
  Big thing missing: some types of REDOs are not handled,
  and the UNDO phase does not exist (missing functions to execute UNDOs
  to actually rollback). So for now tests are only inserting/deleting
  a few 100 rows, closing the table and seeing if the log is applied ok;
  it works. UPDATE not handled.
storage/maria/ma_recovery.h:
  new functions: ma_recover() for recovery from inside ha_maria;
  _ma_apply_log() for maria_read_log (ma_recover() calls _ma_apply_log()).
  Btw, we need to not use the word "recover" for REPAIR/maria_chk anymore.
storage/maria/ma_rename.c:
  don't write log records during recovery
storage/maria/ma_test2.c:
  - fail if maria_info() or other subtests find some wrong information
  - new option -g to skip updates.
  - init the translog before creating the table, so that log applying
  can work.
  - in "#if 0" you'll see some fixed bugs (will be removed).
storage/maria/ma_test_all.sh:
  cleanup files. Test log applying.
storage/maria/maria_read_log.c:
  most of the logic moves to ma_recovery.c to be shared between
  maria_read_log and recovery-from-inside-mysqld.
  See ma_recovery.c for additional changes made to the moved code.
storage/maria/ma_test_recovery:
  unit test for Recovery. Tests insert and delete,
  REDO_UPDATE not yet coded.
  Script is called from ma_test_all. Can run standalone.
2007-07-26 11:56:21 +02:00

320 lines
11 KiB
C

/* Copyright (C) 2006 MySQL AB & MySQL Finland AB & TCX DataKonsult AB
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
/*
WL#3234 Maria control file
First version written by Guilhem Bichot on 2006-04-27.
Does not compile yet.
*/
#include "maria_def.h"
/* Here is the implementation of this module */
/*
a control file contains 3 objects: magic string, LSN of last checkpoint,
number of last log.
*/
/* total size should be < sector size for atomic write operation */
#define CONTROL_FILE_MAGIC_STRING "\xfe\xfe\xc\1MACF"
#define CONTROL_FILE_MAGIC_STRING_OFFSET 0
#define CONTROL_FILE_MAGIC_STRING_SIZE (sizeof(CONTROL_FILE_MAGIC_STRING)-1)
#define CONTROL_FILE_CHECKSUM_OFFSET (CONTROL_FILE_MAGIC_STRING_OFFSET + CONTROL_FILE_MAGIC_STRING_SIZE)
#define CONTROL_FILE_CHECKSUM_SIZE 4
#define CONTROL_FILE_LSN_OFFSET (CONTROL_FILE_CHECKSUM_OFFSET + CONTROL_FILE_CHECKSUM_SIZE)
#define CONTROL_FILE_LSN_SIZE LSN_STORE_SIZE
#define CONTROL_FILE_FILENO_OFFSET (CONTROL_FILE_LSN_OFFSET + CONTROL_FILE_LSN_SIZE)
#define CONTROL_FILE_FILENO_SIZE 4
#define CONTROL_FILE_SIZE (CONTROL_FILE_FILENO_OFFSET + CONTROL_FILE_FILENO_SIZE)
/* This module owns these two vars. */
LSN last_checkpoint_lsn= LSN_IMPOSSIBLE;
uint32 last_logno= FILENO_IMPOSSIBLE;
/**
@brief If log's lock should be asserted when writing to control file.
Can be re-used by any function which needs to be thread-safe except when
it is called at startup.
*/
my_bool maria_multi_threaded= FALSE;
/** @brief if currently doing a recovery */
my_bool maria_in_recovery= FALSE;
/*
Control file is less then 512 bytes (a disk sector),
to be as atomic as possible
*/
static int control_file_fd= -1;
/*
@brief Initialize control file subsystem
Looks for the control file. If none and creation is requested, creates file.
If present, reads it to find out last checkpoint's LSN and last log, updates
the last_checkpoint_lsn and last_logno global variables.
Called at engine's start.
@param create_if_missing
@note
The format of the control file is:
4 bytes: magic string
4 bytes: checksum of the following bytes
4 bytes: number of log where last checkpoint is
4 bytes: offset in log where last checkpoint is
4 bytes: number of last log
@return Operation status
@retval 0 OK
@retval 1 Error (in which case the file is left closed)
*/
CONTROL_FILE_ERROR ma_control_file_create_or_open(my_bool create_if_missing)
{
char buffer[CONTROL_FILE_SIZE];
char name[FN_REFLEN];
MY_STAT stat_buff;
my_bool create_file;
int open_flags= O_BINARY | /*O_DIRECT |*/ O_RDWR;
int error= CONTROL_FILE_UNKNOWN_ERROR;
DBUG_ENTER("ma_control_file_create_or_open");
/*
If you change sizes in the #defines, you at least have to change the
"*store" and "*korr" calls in this file, and can even create backward
compatibility problems. Beware!
*/
DBUG_ASSERT(CONTROL_FILE_LSN_SIZE == (3+4));
DBUG_ASSERT(CONTROL_FILE_FILENO_SIZE == 4);
if (control_file_fd >= 0) /* already open */
DBUG_RETURN(0);
if (fn_format(name, CONTROL_FILE_BASE_NAME,
maria_data_root, "", MYF(MY_WME)) == NullS)
DBUG_RETURN(CONTROL_FILE_UNKNOWN_ERROR);
create_file= test(my_access(name,F_OK));
if (create_file)
{
if (!create_if_missing)
DBUG_RETURN(CONTROL_FILE_MISSING);
if ((control_file_fd= my_create(name, 0,
open_flags, MYF(MY_SYNC_DIR))) < 0)
DBUG_RETURN(CONTROL_FILE_UNKNOWN_ERROR);
/*
To be safer we should make sure that there are no logs or data/index
files around (indeed it could be that the control file alone was deleted
or not restored, and we should not go on with life at this point).
TODO: For now we trust (this is alpha version), but for beta if would
be great to verify.
We could have a tool which can rebuild the control file, by reading the
directory of logs, finding the newest log, reading it to find last
checkpoint... Slow but can save your db. For this to be possible, we
must always write to the control file right after writing the checkpoint
log record, and do nothing in between (i.e. the checkpoint must be
usable as soon as it has been written to the log).
*/
/* init the file with these "undefined" values */
DBUG_RETURN(ma_control_file_write_and_force(LSN_IMPOSSIBLE,
FILENO_IMPOSSIBLE,
CONTROL_FILE_UPDATE_ALL));
}
/* Otherwise, file exists */
if ((control_file_fd= my_open(name, open_flags, MYF(MY_WME))) < 0)
goto err;
if (my_stat(name, &stat_buff, MYF(MY_WME)) == NULL)
goto err;
if ((uint)stat_buff.st_size < CONTROL_FILE_SIZE)
{
/*
Given that normally we write only a sector and it's atomic, the only
possibility for a file to be of too short size is if we crashed at the
very first startup, between file creation and file write. Quite unlikely
(and can be made even more unlikely by doing this: create a temp file,
write it, and then rename it to be the control file).
What's more likely is if someone forgot to restore the control file,
just did a "touch control" to try to get Maria to start, or if the
disk/filesystem has a problem.
So let's be rigid.
*/
/*
TODO: store a message "too small file" somewhere, so that it goes to
MySQL's error log at startup.
*/
error= CONTROL_FILE_TOO_SMALL;
goto err;
}
if ((uint)stat_buff.st_size > CONTROL_FILE_SIZE)
{
/* TODO: store "too big file" message */
error= CONTROL_FILE_TOO_BIG;
goto err;
}
if (my_read(control_file_fd, buffer, CONTROL_FILE_SIZE,
MYF(MY_FNABP | MY_WME)))
goto err;
if (memcmp(buffer + CONTROL_FILE_MAGIC_STRING_OFFSET,
CONTROL_FILE_MAGIC_STRING, CONTROL_FILE_MAGIC_STRING_SIZE))
{
/* TODO: store message "bad magic string" somewhere */
error= CONTROL_FILE_BAD_MAGIC_STRING;
goto err;
}
if (my_checksum(0, buffer + CONTROL_FILE_LSN_OFFSET,
CONTROL_FILE_SIZE - CONTROL_FILE_LSN_OFFSET) !=
uint4korr(buffer + CONTROL_FILE_CHECKSUM_OFFSET))
{
/* TODO: store message "checksum mismatch" somewhere */
error= CONTROL_FILE_BAD_CHECKSUM;
goto err;
}
last_checkpoint_lsn= lsn_korr(buffer + CONTROL_FILE_LSN_OFFSET);
last_logno= uint4korr(buffer + CONTROL_FILE_FILENO_OFFSET);
DBUG_RETURN(0);
err:
ma_control_file_end();
DBUG_RETURN(error);
}
/*
Write information durably to the control file; stores this information into
the last_checkpoint_lsn and last_logno global variables.
Called when we have created a new log (after syncing this log's creation)
and when we have written a checkpoint (after syncing this log record).
Variables last_checkpoint_lsn and last_logno must be protected by caller
using log's lock, unless this function is called at startup.
SYNOPSIS
ma_control_file_write_and_force()
checkpoint_lsn LSN of last checkpoint
logno last log file number
objs_to_write which of the arguments should be used as new values
(for example, CONTROL_FILE_UPDATE_ONLY_LSN will not
write the logno argument to the control file and will
not update the last_logno global variable); can be:
CONTROL_FILE_UPDATE_ALL
CONTROL_FILE_UPDATE_ONLY_LSN
CONTROL_FILE_UPDATE_ONLY_LOGNO.
NOTE
We always want to do one single my_pwrite() here to be as atomic as
possible.
RETURN
0 - OK
1 - Error
*/
int ma_control_file_write_and_force(const LSN checkpoint_lsn, uint32 logno,
uint objs_to_write)
{
char buffer[CONTROL_FILE_SIZE];
my_bool update_checkpoint_lsn= FALSE, update_logno= FALSE;
DBUG_ENTER("ma_control_file_write_and_force");
DBUG_ASSERT(control_file_fd >= 0); /* must be open */
#ifndef DBUG_OFF
if (maria_multi_threaded)
translog_lock_assert_owner();
#endif
memcpy(buffer + CONTROL_FILE_MAGIC_STRING_OFFSET,
CONTROL_FILE_MAGIC_STRING, CONTROL_FILE_MAGIC_STRING_SIZE);
if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LSN)
update_checkpoint_lsn= TRUE;
else if (objs_to_write == CONTROL_FILE_UPDATE_ONLY_LOGNO)
update_logno= TRUE;
else if (objs_to_write == CONTROL_FILE_UPDATE_ALL)
update_checkpoint_lsn= update_logno= TRUE;
else /* incorrect value of objs_to_write */
DBUG_ASSERT(0);
if (update_checkpoint_lsn)
lsn_store(buffer + CONTROL_FILE_LSN_OFFSET, checkpoint_lsn);
else /* store old value == change nothing */
lsn_store(buffer + CONTROL_FILE_LSN_OFFSET, last_checkpoint_lsn);
if (update_logno)
int4store(buffer + CONTROL_FILE_FILENO_OFFSET, logno);
else
int4store(buffer + CONTROL_FILE_FILENO_OFFSET, last_logno);
{
uint32 sum= (uint32)
my_checksum(0, buffer + CONTROL_FILE_LSN_OFFSET,
CONTROL_FILE_SIZE - CONTROL_FILE_LSN_OFFSET);
int4store(buffer + CONTROL_FILE_CHECKSUM_OFFSET, sum);
}
if (my_pwrite(control_file_fd, buffer, sizeof(buffer),
0, MYF(MY_FNABP | MY_WME)) ||
my_sync(control_file_fd, MYF(MY_WME)))
DBUG_RETURN(1);
if (update_checkpoint_lsn)
last_checkpoint_lsn= checkpoint_lsn;
if (update_logno)
last_logno= logno;
DBUG_RETURN(0);
}
/*
Free resources taken by control file subsystem
SYNOPSIS
ma_control_file_end()
*/
int ma_control_file_end()
{
int close_error;
DBUG_ENTER("ma_control_file_end");
if (control_file_fd < 0) /* already closed */
DBUG_RETURN(0);
close_error= my_close(control_file_fd, MYF(MY_WME));
/*
As my_close() frees structures even if close() fails, we do the same,
i.e. we mark the file as closed in all cases.
*/
control_file_fd= -1;
/*
As this module owns these variables, closing the module forbids access to
them (just a safety):
*/
last_checkpoint_lsn= LSN_IMPOSSIBLE;
last_logno= FILENO_IMPOSSIBLE;
DBUG_RETURN(close_error);
}