mirror of
https://github.com/MariaDB/server.git
synced 2025-01-25 00:04:33 +01:00
452 lines
17 KiB
Text
452 lines
17 KiB
Text
# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
|
|
|
|
The design of file operation recovery.
|
|
|
|
Keith has asked me to write up notes on our current status of database
|
|
create and delete and recovery, why it's so hard, and how we've violated
|
|
all the cornerstone assumptions on which our recovery framework is based.
|
|
|
|
I am including two documents at the end of this one. The first is the
|
|
initial design of the recoverability of file create and delete (there is
|
|
no talk of subdatabases there, because we didn't think we'd have to do
|
|
anything special there). I will annotate this document on where things
|
|
changed.
|
|
|
|
The second is the design of recd007 which is supposed to test our ability
|
|
to recover these operations regardless of where one crashes. This test
|
|
is fundamentally different from our other recovery tests in the following
|
|
manner. Normally, the application controls transaction boundaries.
|
|
Therefore, we can perform an operation and then decide whether to commit
|
|
or abort it. In the normal recovery tests, we force the database into
|
|
each of the four possible states from a recovery perspective:
|
|
|
|
database is pre-op, undo (do nothing)
|
|
database is pre-op, redo
|
|
database is post-op, undo
|
|
database is post-op, redo (do nothing)
|
|
|
|
By copying databases at various points and initiating txn_commit and abort
|
|
appropriately, we can make all these things happen. Notice that the one
|
|
case we don't handle is where page A is in one state (e.g., pre-op) and
|
|
page B is in another state (e.g., post-op). I will argue that these don't
|
|
matter because each page is recovered independently. If anyone can poke
|
|
holes in this, I'm interested.
|
|
|
|
The problem with create/delete recovery testing is that the transaction
|
|
is begun and ended all inside the library. Therefore, there is never any
|
|
point (outside the library) where we can copy files and or initiate
|
|
abort/commit. In order to still put the recovery code through its paces,
|
|
Sue designed an infrastructure that lets you tell the library where to
|
|
make copies of things and where to suddenly inject errors so that the
|
|
transaction gets aborted. This level of detail allows us to push the
|
|
create/delete recovery code through just about every recovery path
|
|
possible (although I'm sure Mike will tell me I'm wrong when he starts to
|
|
run code coverage tools).
|
|
|
|
OK, so that's all preamble and a brief discussion of the documents I'm
|
|
enclosing.
|
|
|
|
Why was this so hard and painful and why is the code so Q@#$!% complicated?
|
|
The following is a discussion/explanation, but to the best of my knowledge,
|
|
the structure we have in place now works. The key question we need to be
|
|
asking is, "Does this need to have to be so complex or should we redesign
|
|
portions to simplify it?" At this point, there is no obvious way to simplify
|
|
it in my book, but I may be having difficulty seeing this because my mind is
|
|
too polluted at this point.
|
|
|
|
Our overall strategy for recovery is that we do write-ahead logging,
|
|
that is we log an operation and make sure it is on disk before any
|
|
data corresponding to the data that log record describes is on disk.
|
|
Typically we use log sequence numbers (LSNs) to mark the data so that
|
|
during recovery, we can look at the data and determine if it is in a
|
|
state before a particular log record or after a particular log record.
|
|
|
|
In the good old days, opens were not transaction protected, so we could
|
|
do regular old opens during recovery and if the file existed, we opened
|
|
it and if it didn't (or appeared corrupt), we didn't and treated it like
|
|
a missing file. As will be discussed below in detail, our states are much
|
|
more complicated and recovery can't make such simplistic assumptions.
|
|
|
|
Also, since we are now dealing with file system operations, we have less
|
|
control about when they actually happen and what the state of the system
|
|
can be. That is, we have to write create log records synchronously, because
|
|
the create/open system call may force a newly created (0-length) file to
|
|
disk. This file has to now be identified as being in the "being-created"
|
|
state.
|
|
|
|
A. We used to make a number of assumptions during recovery:
|
|
|
|
1. We could call db_open at any time and one of three things would happen:
|
|
a) the file would be opened cleanly
|
|
b) the file would not exist
|
|
c) we would encounter an error while opening the file
|
|
|
|
Case a posed no difficulty.
|
|
In Case b, we simply spit out a warning that a file was missing and then
|
|
ignored all subsequent operations to that file.
|
|
In Case c, we reported a fatal error.
|
|
|
|
2. We can always generate a warning if a file is missing.
|
|
|
|
3. We never encounter NULL file names in the log.
|
|
|
|
B. We also made some assumptions in the main-line library:
|
|
|
|
1. If you try to open a file and it exists but is 0-length, then
|
|
someone else is trying to open it.
|
|
|
|
2. You can write pages anywhere in a file and any non-existent pages
|
|
are 0-filled. [This breaks on Windows.]
|
|
|
|
3. If you have proper permissions then you can always evict pages from
|
|
the buffer pool.
|
|
|
|
4. During open, we can close the master database handle as soon as
|
|
we're done with it since all the rest of the activity will take place
|
|
on the subdatabase handle.
|
|
|
|
In our brave new world, most of these assumptions are no longer valid.
|
|
Let's address them one at a time.
|
|
|
|
A.1 We could call db_open at any time and one of three things would happen:
|
|
a) the file would be opened cleanly
|
|
b) the file would not exist
|
|
c) we would encounter an error while opening the file
|
|
There are now additional states. Since we are trying to make file
|
|
operations recoverable, you can now die in the middle of such an
|
|
operation and we have to be able to pick up the pieces. What this
|
|
now means is that:
|
|
|
|
* a 0-length file can be an indication of a create in-progress
|
|
* you can have a meta-data page but no root page (of a btree)
|
|
* if a file doesn't exist, it could mean that it was just about
|
|
to be created and needs to be rolled forward.
|
|
* if you encounter an error in a file (e.g., the meta-data page
|
|
is all 0's) you could still be in mid-open.
|
|
|
|
I have now made this all work, but it required significant changes to the
|
|
db_open code and error handling and this is the sort of change that makes
|
|
everyone nervous.
|
|
|
|
A.2. We can always generate a warning if a file is missing.
|
|
|
|
Now that we have a delete file method in the API, we need to make sure
|
|
that we do not generate warning messages for files that don't exist if
|
|
we see that they were explicitly deleted.
|
|
|
|
This means that we need to save state during recovery, determine which
|
|
files were missing and were not being recreated and were not deleted and
|
|
only complain about those.
|
|
|
|
A.3. We never encounter NULL file names in the log.
|
|
|
|
Now that we allow tranaction protection on memory-resident files, we write
|
|
log messages for files with NULL file names. This means that our assumption
|
|
of always being able to call "db_open" on any log_register OPEN message found
|
|
in the log is no longer valid.
|
|
|
|
B.1. If you try to open a file and it exists but is 0-length, then
|
|
someone else is trying to open it.
|
|
|
|
As discussed for A.1, this is no longer true. It may be instead that you
|
|
are in the process of recovering a create.
|
|
|
|
B.2. You can write pages anywhere in a file and any non-existent pages
|
|
are 0-filled.
|
|
|
|
It turns out that this is not true on Windows. This means that places
|
|
we do group allocation (hash) must explicitly allocate each page, because
|
|
we can't count on recognizing the uninitialized pages later.
|
|
|
|
B.3. If you have proper permissions then you can always evict pages from
|
|
the buffer pool.
|
|
|
|
In the brave new world though, files can be deleted and they may
|
|
have pages in the mpool. If you later try to evict these, you
|
|
discover that the file doesn't exist. We'd get here when we had
|
|
to dirty pages during a remove operation.
|
|
|
|
B.4. You can close files any time you want.
|
|
|
|
However, if the file takes part in the open/remove transaction,
|
|
then we had better not close it until after the transaction
|
|
commits/aborts, because we need to be able to get our hands on the
|
|
dbp and the open happened in a different transaction.
|
|
|
|
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
|
|
Design for recovering file create and delete in the presence of subdatabases.
|
|
|
|
Assumptions:
|
|
Remove the O_TRUNCATE flag.
|
|
Single-thread all open/create/delete operations.
|
|
(Well, almost all; we'll optimize opens without DB_CREATE set.)
|
|
The reasoning for this is that with two simultaneous
|
|
open/creaters, during recovery, we cannot identify which
|
|
transaction successfully created files and therefore cannot
|
|
recovery correctly.
|
|
File system creates/deletes are synchronous
|
|
Once the file is open, subdatabase creates look like regular
|
|
get/put operations and a metadata page creation.
|
|
|
|
There are 4 cases to deal with:
|
|
1. Open/create file
|
|
2. Open/create subdatabase
|
|
3. Delete
|
|
4. Recovery records
|
|
|
|
__db_fileopen_recover
|
|
__db_metapage_recover
|
|
__db_delete_recover
|
|
existing c_put and c_get routines for subdatabase creation
|
|
|
|
Note that the open/create of the file and the open/create of the
|
|
subdatabase need to be in the same transaction.
|
|
|
|
1. Open/create (full file and subdb version)
|
|
|
|
If create
|
|
LOCK_FILEOP
|
|
txn_begin
|
|
log create message (open message below)
|
|
do file system open/create
|
|
if we did not create
|
|
abort transaction (before going to open_only)
|
|
if (!subdb)
|
|
set dbp->open_txn = NULL
|
|
else
|
|
txn_begin a new transaction for the subdb open
|
|
|
|
construct meta-data page
|
|
log meta-data page (see metapage)
|
|
write the meta-data page
|
|
* It may be the case that btrees need to log both meta-data pages
|
|
and root pages. If that is the case, I believe that we can use
|
|
this same record and recovery routines for both
|
|
|
|
txn_commit
|
|
UNLOCK_FILEOP
|
|
|
|
2. Delete
|
|
LOCK_FILEOP
|
|
txn_begin
|
|
log delete message (delete message below)
|
|
mv file __db.file.lsn
|
|
txn_commit
|
|
unlink __db.file.lsn
|
|
UNLOCK_FILEOP
|
|
|
|
3. Recovery Routines
|
|
|
|
__db_fileopen_recover
|
|
if (argp->name.size == 0
|
|
done;
|
|
|
|
if (redo) /* Commit */
|
|
__os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
|
|
__os_closehandle(fh)
|
|
if (undo) /* Abort */
|
|
if (argp->name exists)
|
|
unlink(argp->name);
|
|
|
|
__db_metapage_recover
|
|
if (redo)
|
|
__os_open(argp->name, 0, 0, &fh)
|
|
__os_lseek(meta data page)
|
|
__os_write(meta data page)
|
|
__os_closehandle(fh);
|
|
if (undo)
|
|
done = 0;
|
|
if (argp->name exists)
|
|
if (length of argp->name != 0)
|
|
__os_open(argp->name, 0, 0, &fh)
|
|
__os_lseek(meta data page)
|
|
__os_read(meta data page)
|
|
if (read succeeds && page lsn != current_lsn)
|
|
done = 1
|
|
__os_closehandle(fh);
|
|
if (!done)
|
|
unlink(argp->name)
|
|
|
|
__db_delete_recover
|
|
if (redo)
|
|
Check if the backup file still exists and if so, delete it.
|
|
|
|
if (undo)
|
|
if (__db_appname(__db.file.lsn exists))
|
|
mv __db_appname(__db.file.lsn) __db_appname(file)
|
|
|
|
__db_metasub_recover
|
|
/* This is like a normal recovery routine */
|
|
Get the metadata page
|
|
if (cmp_n && redo)
|
|
copy the log page onto the page
|
|
update the lsn
|
|
make sure page gets put dirty
|
|
else if (cmp_p && undo)
|
|
update the lsn to the lsn in the log record
|
|
make sure page gets put dirty
|
|
|
|
if the page was modified, put it back dirty
|
|
|
|
In db.src
|
|
|
|
# name: filename (before call to __db_appname)
|
|
# mode: file system mode
|
|
BEGIN open
|
|
DBT name DBT s
|
|
ARG mode u_int32_t o
|
|
END
|
|
|
|
# opcode: indicate if it is a create/delete and if it is a subdatabase
|
|
# pgsize: page size on which we're going to write the meta-data page
|
|
# pgno: page number on which to write this meta-data page
|
|
# page: the actual meta-data page
|
|
# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
|
|
# for subdatabases.
|
|
|
|
BEGIN metapage
|
|
ARG opcode u_int32_t x
|
|
DBT name DBT s
|
|
ARG pgno db_pgno_t d
|
|
DBT page DBT s
|
|
POINTER lsn DB_LSN * lu
|
|
END
|
|
|
|
# We do not need a subdatabase name here because removing a subdatabase
|
|
# name is simply a regular bt_delete operation from the master database.
|
|
# It will get logged normally.
|
|
# name: filename
|
|
BEGIN delete
|
|
DBT name DBT s
|
|
END
|
|
|
|
# We also need to reclaim pages, but we can use the existing
|
|
# bt_pg_alloc routines.
|
|
|
|
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
|
|
Testing recoverability of create/delete.
|
|
|
|
These tests are unlike other tests in that they are going to
|
|
require hooks in the library. The reason is that the create
|
|
and delete calls are internally wrapped in a transaction, so
|
|
that if the call returns, the transaction has already either
|
|
commited or aborted. Using only that interface limits what
|
|
kind of testing we can do. To match our other recovery testing
|
|
efforts, we need to add hooks to trigger aborts at particular
|
|
times in the create/delete path.
|
|
|
|
The general recovery testing strategy is that we wish to
|
|
execute every path through every recovery routine. That
|
|
means that we try to:
|
|
catch each operation in its pre-operation state
|
|
call the recovery function with redo
|
|
call the recovery function with undo
|
|
catch each operation in its post-operation state
|
|
call the recovery function with redo
|
|
call the recovery function with undo
|
|
|
|
In addition, there are a few critical points in the create and
|
|
delete path that we want to make sure we capture.
|
|
|
|
1. Test Structure
|
|
|
|
The test structure should be similar to the existing recovery
|
|
tests. We will want to have a structure in place where we
|
|
can execute different commands:
|
|
create a file/database
|
|
create a file that will contain subdatabases.
|
|
create a subdatabase
|
|
remove a subdatabase (that contains valid data)
|
|
remove a subdatabase (that does not contain any data)
|
|
remove a file that used to contain subdatabases
|
|
remove a file that contains a database
|
|
|
|
The tricky part is capturing the state of the world at the
|
|
various points in the create/delete process.
|
|
|
|
The critical points in the create process are:
|
|
|
|
1. After we've logged the create, but before we've done anything.
|
|
in db/db.c
|
|
after the open_retry
|
|
after the __crdel_fileopen_log call (and before we've
|
|
called __os_open).
|
|
|
|
2. Immediately after the __os_open
|
|
|
|
3. Immediately after each __db_log_page call
|
|
in bt_open.c
|
|
log meta-data page
|
|
log root page
|
|
in hash.c
|
|
log meta-data page
|
|
|
|
4. With respect to the log records above, shortly after each
|
|
log write is an memp_fput. We need to do a sync after
|
|
each memp_fput and trigger a point after that sync.
|
|
|
|
The critical points in the remove process are:
|
|
|
|
1. Right after the crdel_delete_log in db/db.c
|
|
|
|
2. Right after the __os_rename call (below the crdel_delete_log)
|
|
|
|
3. After the __db_remove_callback call.
|
|
|
|
I believe that there are the places where we'll need some sort of hook.
|
|
|
|
2. Adding hooks to the library.
|
|
|
|
The hooks need two components. One component is to capture the state of
|
|
the database at the hook point and the other is to trigger a txn_abort at
|
|
the hook point. The second part is fairly trivial.
|
|
|
|
The first part requires more thought. Let me explain what we do in a
|
|
"normal" recovery test. In a normal recovery test, we save an intial
|
|
copy of the database (this copy is called init). Then we execute one
|
|
or more operations. Then, right before the commit/abort, we sync the
|
|
file, and save another copy (the afterop copy). Finally, we call txn_commit
|
|
or txn_abort, sync the file again, and save the database one last time (the
|
|
final copy).
|
|
|
|
Then we run recovery. The first time, this should be a no-op, because
|
|
we've either committed the transaction and are checking to redo it or
|
|
we aborted the transaction, undid it on the abort and are checking to
|
|
undo it again.
|
|
|
|
We then run recovery again on whatever database will force us through
|
|
the path that requires work. In the commit case, this means we start
|
|
with the init copy of the database and run recovery. This pushes us
|
|
through all the redo paths. In the abort case, we start with the afterop
|
|
copy which pushes us through all the undo cases.
|
|
|
|
In some sense, we're asking the create/delete test to be more exhaustive
|
|
by defining all the trigger points, but I think that's the correct thing
|
|
to do, since the create/delete is not initiated by a user transaction.
|
|
|
|
So, what do we have to do at the hook points?
|
|
1. sync the file to disk.
|
|
2. save the file itself
|
|
3. save any files named __db_backup_name(name, &backup_name, lsn)
|
|
Since we may not know the right lsns, I think we should save
|
|
every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
|
|
some temporary files from which we can restore it to run
|
|
recovery.
|
|
|
|
3. Putting it all together
|
|
|
|
So, the three pieces are writing the test structure, putting in the hooks
|
|
and then writing the recovery portions so that we restore the right thing
|
|
that the hooks saved in order to initiate recovery.
|
|
|
|
Some of the technical issues that need to be solved are:
|
|
How does the hook code become active (i.e., we don't
|
|
want it in there normally, but it's got to be
|
|
there when you configure for testing)?
|
|
How do you (the test) tell the library that you want a
|
|
particular hook to abort?
|
|
How do you (the test) tell the library that you want the
|
|
hook code doing its copies (do we really want
|
|
*every* test doing these copies during testing?
|
|
Maybe it's not a big deal, but maybe it is; we
|
|
should at least think about it).
|