The essential idea of auto-upgrade from BRT_LAYOUT_VERSION 12 to 13 is to take advantage of the similarities between the two versions, and not to try to create an infrastructure for all future upgrades. As future layouts are created, upgrade paths, if any, will be crafted to each particular change. On startup, the version number of the recovery log is checked. If an upgrade is needed, then the log is tested for a clean shutdown. If there is no clean shutdown, then an error is returned. If the log does end in a clean shutdown, then a new log file is created with the current version number, starting with an LSN that is one greater than the clean shutdown. Once the new log is in place, the persistent environment dictionary is upgraded, and then normal operation begins. The startup of a new version of the storage engine might not be crash safe. Dictionaries, including the persistent environment and the fileops directory, are upgraded as they are read into memory from disk. The brt header is upgraded by - removing an unused flag - setting the transaction id to the xid of the clean shutdown - marking the header as dirty Each non-leaf node is upgraded by: - removing an unused flag - upgrading the version numbers in the node - marking the node as dirty. This works because all of the version 12 messages are unchanged in version 13. The version 12 messages will be applied to the leafentries using version 13 code. Each non-leaf node is upgraded by - removing an unused flag - using modified version 12 code to unpack the version 12 packed leaf entries into version 13 unpacked leaf entries - repacking the leafentries into a new mempool - destroying the original mempool (that holds the version 12 node read from disk) The node is marked as dirty. Once the brt is open, a BRT_OPTIMIZE broadcast message is inserted to optimize the dictionary. A schematic overview of how a brt node is deserialized: toku_deserialize_brtnode_from() { // accepts fd, fills in BRTNODE, brt_header deserialize_brtnode_from_rbuf_versioned() { deserialize_brtnode_from_rbuf() // accepts rbuf fills in BRTNODE if nonleaf deserialize_brtnode_nonleaf_from_rbuf(){ // rbuf -> BRTNODE (no version sensitivity) if leaf deserialize_brtnode_leaf_from_rbuf() { // calculates node size from leafentry sizes // leafentry sizes vary with version if version 12 { if leaf { unpack each leafentry into a version 13 ule pack each version 13 ule into version 13 le allocate new mempool for version 13 les destroy old mempool } remove unused flag increment version number mark dirty } } } Open issues: - The brt layer makes some callbacks to the handlerton layer. If any of the functions change from one version to another, then the result may not be correct. A version number could be included in all the function signatures so the callback function could be aware of what version the caller is expecting. The callbacks are: - comparator - hot index generator - hot column mutator Note, brt-internal.h defines struct subtree_estimates which contains field nkeys. This field is obsolete with the removal of dupsort databases (since it will always be the same as ndata), but removing it is not worth the trouble. ========== The changes from version 12 to 13 include (may not be complete list): - Persistent environment dictionary - version number - timestamp of environment creation (database installation) - history of previous versions - timestamps for upgrades - Recovery log - version number - new log entries (hotindex, maybe others) - brt header - version number - added field (root_xid_that_created), set to last checkpoint lsn - deleted flag (built-in comparison function for values) - brt internal node - version number - additional message(s) possible, no upgrade needed beyond changing version number - brt leafnode - version number - new leafentry format - version 12 leafentry unpack code is preserved - rollback log - version number is only change, no upgrade is needed because rollback logs are not preserved through clean shutdown Because version 12 and version 13 leafentries are significantly different, the way leafentries is handled is as follows: - deserialize_brtnode_leaf_from_rbuf() - sets up array of pointers to leafentries (to be unpacked later), these pointers are put into an OMT - calculates checksum (x1764) - adjusts ndone byte counter to verify that entire rbuf is read - deserialize_brtnode_from_rbuf_versioned() calls deserialize_brtnode_leaf_from_rbuf() - loop through all leafentries, one at a time: - unpack version 12 le and repack as version 13 le, each in its own malloc'ed memory - calculate new fingerprint - create new block - allocate new mempool - copy individual les into new mempool - destroy individual les - destroy original mempool Open issues: - We need to verify clean shutdown before upgrade. If shutdown was not clean then we would run recovery, and the code does not support recovering from an old format version. - One way to do this is to increase the log version number (either increment or synchronize with BRT_LAYOUT_VERSION). - Can we just look at the log? needs_recovery(env); If this mechanism is specific to the version 12 to 13 upgrade, then that is adequate. Once the recovery log format changes, then we need a different mechanism, similar to the 3.x->4.x upgrade logic in log_upgrade.c. - How to decide that an upgrade is necessary? Needed for logic that says: - If upgrade is necessary, then verify clean shutdown: If upgrade is necessary (recorded version is old) and clean shutdown was not done, then exit with error code. - tokudb_needs_recovery() is not separate from verification of clean shutdown. This function indicates if a recovery is necessary, but it does not verify simple clean shutdown with just the shutdown log entry. Instead, it looks for checkpoint begin/checkpoint end. (Also, comment at end is permitted.) Proposed solution: - Decision on whether to perform upgrade is done by examining log version. - If we need an upgrade: - If not clean shutdown, then exit with error message, change nothing on disk. - If clean shutdown, then create new log by simply creating new log file (empty, or perhaps with initial comment that says "start of new log"). - Normal log-trimming code will delete old logs. (None of the locking logic in log_upgrade.c is needed.) - Log-opening logic needs to be modified to do this. See log file manager initialization function (and maybe functions it calls), maybe the log cursor: - logfilemgr.c: toku_logfilemgr_init() - Log-trimming logic loops over pairs of file names and LSNs, deleting old files based on LSN. - Question: would it help any if the "clean shutdown" log entry was required to be in a new log file of its own? It would prevent the creation of an empty log file after "clean shutdown." It might, but it's probably not worth doing. Issue of optimize message (to be sent into each dictionary on upgrade) - BRT_COMMIT_BROADCAST_ALL (should be faster executing, always commits everything, was needed for an earlier upgrade attempt) - BRT_OPTIMIZE (better tested, has been used, tests to see if transactions are still live) After upgrade (after clean shutdown, no running transactions, trees fully flattened), there is no difference in what these two message do. Note, BRT_OPTIMIZE requires a clean shutdown if used on upgrade. If used before recovery (which an upgrade without clean shutdown would do), then it would be wrong because it would appear that all transactions were completed. TODO: - update brt header fields - original layout version - version read from disk - add accountability counters - capture LSN of clean shutdown, use instead of checkpoint lsn