mirror of
https://github.com/MariaDB/server.git
synced 2025-01-29 02:05:57 +01:00
WL#3071 Maria checkpoint
Fixing bad comments (I remember my maths' teacher "one late night you'll obey to the simplifications made by your tired neurons"; exactly what happened here). In Checkpoint, when we flush a table's state we must flush all log records (WAL), not only those before checkpoint started. storage/maria/ma_bitmap.c: there was a flaw in reasoning, bug does exist. storage/maria/ma_blockrec.c: moving piece of comment to ma_checkpoint.c storage/maria/ma_checkpoint.c: Comments. When checkpoint flushes a state, WAL imposes that all records up to this state have been flushed, not only up to checkpoint_start_log_horizon. storage/maria/ma_recovery.c: finishing comment.
This commit is contained in:
parent
c2084d2a13
commit
086b34c935
4 changed files with 71 additions and 60 deletions
|
@ -533,16 +533,25 @@ static my_bool _ma_read_bitmap_page(MARIA_SHARE *share,
|
|||
Inexistent or half-created page (could be crash in the middle of
|
||||
_ma_bitmap_create_first(), before appending maria_bitmap_marker).
|
||||
*/
|
||||
/*
|
||||
We are updating data_file_length before writing any log record for the
|
||||
row operation. What if now state is flushed by a checkpoint with the
|
||||
new value, and crash before the checkpoint record is written, recovery
|
||||
may not even open the table (no log records) so not fix
|
||||
data_file_length ("WAL violation")? In fact this is ok:
|
||||
- checkpoint flushes state only if share->id!=0
|
||||
- so if state was flushed, table had share->id!=0, so had a
|
||||
LOGREC_FILE_ID (or was in previous checkpoint record), so recovery will
|
||||
meet and open it and fix data_file_length.
|
||||
/**
|
||||
@todo RECOVERY BUG
|
||||
We are updating data_file_length before writing any log record for the
|
||||
row operation. What if now state is flushed by a checkpoint with the
|
||||
new value, and crash before the checkpoint record is written, recovery
|
||||
may not even open the table (no log records) so not fix
|
||||
data_file_length ("WAL violation")?
|
||||
Scenario: assume share->id==0, then:
|
||||
thread 1 (here) thread 2 (checkpoint)
|
||||
update data_file_length
|
||||
copy state to memory, flush log
|
||||
set share->id and write FILE_ID (not flushed)
|
||||
see share->id!=0 so flush state
|
||||
crash
|
||||
FILE_ID will be missing, Recovery will not open table and not fix
|
||||
data_file_length. This bug should be fixed with other "checkpoint vs
|
||||
bitmap" bugs.
|
||||
One possibility will be logging a standalone LOGREC_CREATE_BITMAP in a
|
||||
separate transaction (using dummy_transaction_object).
|
||||
*/
|
||||
share->state.state.data_file_length= end_of_page;
|
||||
bzero(bitmap->map, bitmap->block_size);
|
||||
|
@ -598,6 +607,12 @@ static my_bool _ma_change_bitmap_page(MARIA_HA *info,
|
|||
|
||||
if (bitmap->changed)
|
||||
{
|
||||
/**
|
||||
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
|
||||
disk even though it could be over-allocated with not yet any REDO-UNDO
|
||||
complete group (WAL violation: no way to undo the over-allocation if
|
||||
crash). See also collect_tables().
|
||||
*/
|
||||
if (write_changed_bitmap(info->s, bitmap))
|
||||
DBUG_RETURN(1);
|
||||
bitmap->changed= 0;
|
||||
|
|
|
@ -1444,17 +1444,9 @@ static my_bool write_tail(MARIA_HA *info,
|
|||
{
|
||||
/*
|
||||
We are modifying a state member before writing the UNDO; this is a WAL
|
||||
violation: assume this setting is made, checkpoint flushes new state,
|
||||
and crash happens before the UNDO is written: how to undo the bad state?
|
||||
Fortunately for data_file_length this is ok: as long as we change
|
||||
data_file_length after writing any REDO or UNDO we are safe:
|
||||
- checkpoint flushes state only if it's older than
|
||||
checkpoint_start_log_horizon, and flushes log up to that horizon first
|
||||
- so if checkpoint flushed state with new data_file_length, REDO is in
|
||||
log so LOGREC_FILE_ID too, recovery will meet and open the table thus
|
||||
fix data_file_length to be the file's physical size.
|
||||
Same property is currently true in all places of this file which change
|
||||
data_file_length.
|
||||
violation. But for data_file_length this is ok, as long as we change
|
||||
data_file_length after writing any log record (FILE_ID/REDO/UNDO) (see
|
||||
collect_tables()).
|
||||
*/
|
||||
info->state->data_file_length= position + block_size;
|
||||
}
|
||||
|
|
|
@ -176,27 +176,6 @@ static int really_execute_checkpoint(void)
|
|||
DBUG_PRINT("info",("checkpoint_start_log_horizon (%lu,0x%lx)",
|
||||
LSN_IN_PARTS(checkpoint_start_log_horizon)));
|
||||
lsn_store(checkpoint_start_log_horizon_char, checkpoint_start_log_horizon);
|
||||
/*
|
||||
We are going to flush the state of some tables (in collect_tables()) if
|
||||
it's older than checkpoint_start_log_horizon. Before, all records
|
||||
describing how to undo this flushed state must be in the log
|
||||
(WAL). Usually this means UNDOs. In the special case of data_file_length,
|
||||
recovery just needs to open the table, so any LOGREC_FILE_ID/REDO/UNDO
|
||||
allowing recovery to understand it must open a table, is enough.
|
||||
*/
|
||||
/**
|
||||
Apart from data|key_file_length which are easily recoverable from the OS,
|
||||
all other state members must be updated only when writing the UNDO;
|
||||
otherwise, if updated before, if their new value is flushed by a
|
||||
checkpoint and there is a crash before UNDO is written, their REDO group
|
||||
will be missing or at least incomplete and skipped by recovery, so bad
|
||||
state value will stay. For example, setting key_root before writing the
|
||||
UNDO: the table would have old index page (they were pinned at time of
|
||||
crash) and a new, thus wrong, key_root.
|
||||
@todo RECOVERY BUG check that all code honours that.
|
||||
*/
|
||||
if (translog_flush(checkpoint_start_log_horizon))
|
||||
goto err;
|
||||
|
||||
/*
|
||||
STEP 2: fetch information about transactions.
|
||||
|
@ -887,9 +866,32 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
|
|||
*/
|
||||
}
|
||||
translog_unlock();
|
||||
/**
|
||||
We are going to flush these states.
|
||||
Before, all records describing how to undo such state must be
|
||||
in the log (WAL). Usually this means UNDOs. In the special case of
|
||||
data|key_file_length, recovery just needs to open the table to fix the
|
||||
length, so any LOGREC_FILE_ID/REDO/UNDO allowing recovery to
|
||||
understand it must open a table, is enough; so as long as
|
||||
data|key_file_length is updated after writing any log record it's ok:
|
||||
if we copied new value above, it means the record was before
|
||||
state_copies_horizon and we flush such record below.
|
||||
Apart from data|key_file_length which are easily recoverable from the
|
||||
real file's size, all other state members must be updated only when
|
||||
writing the UNDO; otherwise, if updated before, if their new value is
|
||||
flushed by a checkpoint and there is a crash before UNDO is written,
|
||||
their REDO group will be missing or at least incomplete and skipped
|
||||
by recovery, so bad state value will stay. For example, setting
|
||||
key_root before writing the UNDO: the table would have old index
|
||||
pages (they were pinned at time of crash) and a new, thus wrong,
|
||||
key_root.
|
||||
@todo RECOVERY BUG check that all code honours that.
|
||||
*/
|
||||
if (translog_flush(state_copies_horizon))
|
||||
goto err;
|
||||
/* now we have cached states and they are WAL-safe*/
|
||||
state_copies_end= state_copy;
|
||||
state_copy= state_copies;
|
||||
/* so now we have cached states */
|
||||
}
|
||||
|
||||
/* locate our state among these cached ones */
|
||||
|
@ -911,15 +913,11 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
|
|||
dfile= share->bitmap.file;
|
||||
/*
|
||||
Ignore table which has no logged writes (all its future log records will
|
||||
be found naturally by Recovery). This also avoids flushing
|
||||
a data_file_length changed too early by a client (before any log record
|
||||
was written, giving no chance to recovery to meet and open the table,
|
||||
see _ma_read_bitmap_page()).
|
||||
Ignore obsolete shares (_before_ setting themselves to last_version=0
|
||||
they already did all flush and sync; if we flush their state now we may
|
||||
be flushing an obsolete state onto a newer one (assuming the table has
|
||||
been reopened with a different share but of course same physical index
|
||||
file).
|
||||
be found naturally by Recovery). Ignore obsolete shares (_before_
|
||||
setting themselves to last_version=0 they already did all flush and
|
||||
sync; if we flush their state now we may be flushing an obsolete state
|
||||
onto a newer one (assuming the table has been reopened with a different
|
||||
share but of course same physical index file).
|
||||
*/
|
||||
if ((share->id != 0) && (share->last_version != 0))
|
||||
{
|
||||
|
@ -974,7 +972,6 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
|
|||
It may also be a share which got last_version==0 since we checked
|
||||
last_version; in this case, it flushed its state and the LSN test
|
||||
above will catch it.
|
||||
Last, see comments at start of really_execute_checkpoint().
|
||||
*/
|
||||
}
|
||||
else
|
||||
|
@ -1005,6 +1002,12 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
|
|||
each checkpoint if the table was once written and then not anymore.
|
||||
*/
|
||||
}
|
||||
/**
|
||||
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
|
||||
disk even though it could be over-allocated with not yet any
|
||||
REDO-UNDO complete group (WAL violation: no way to undo the
|
||||
over-allocation if crash); see also _ma_change_bitmap_page().
|
||||
*/
|
||||
sync_error|=
|
||||
_ma_flush_bitmap(share); /* after that, all is in page cache */
|
||||
DBUG_ASSERT(share->pagecache == maria_pagecache);
|
||||
|
|
|
@ -1156,13 +1156,14 @@ prototype_redo_exec_hook(REDO_INSERT_ROW_HEAD)
|
|||
/**
|
||||
@todo RECOVERY BUG
|
||||
we stamp page with UNDO's LSN. Assume an operation logs REDO-REDO-UNDO
|
||||
where the two REDOs are about the same page. Then recovery applies first
|
||||
REDO and skips second REDO which is wrong. Solutions:
|
||||
a) when applying REDO, keep page pinned, don't stamp it, remember it;
|
||||
when seeing UNDO, unpin pages and stamp them; for BLOB pages we cannot
|
||||
pin them (too large for memory) so need an additional pass in REDO phase:
|
||||
- find UNDO
|
||||
- execute all REDOs about this UNDO but skipping REDOs for
|
||||
where the two REDOs are about the same page (that is possible only with a
|
||||
head or tail page, not blob page). Then recovery applies first REDO and
|
||||
skips second REDO which is wrong. Solution:
|
||||
a)
|
||||
* when applying REDO to head or tail, keep page pinned, don't stamp it,
|
||||
* when applying REDO to blob page, stamp it with UNDO's LSN
|
||||
* when seeing UNDO, unpin head/tail pages and stamp them with UNDO's
|
||||
LSN.
|
||||
or b) when applying REDO, stamp page with REDO's LSN (=> difference in
|
||||
'cmp' between run-time and recovery, need a special 'cmp'...).
|
||||
*/
|
||||
|
|
Loading…
Add table
Reference in a new issue