We provide an interface allowing secure access to Berkeley DB. Our goal is to allow users to have encrypted secure databases. In this document, the term ciphering means the act of encryption or decryption. They are equal but opposite actions and the same issues apply to both just in the opposite direction.
Falling out from this work will be a simple mechanism to allow users to request that we checksum their data for additional error detection (without encryption/decryption).
We expect that data in process memory or stored in shared memory, potentially backed by disk, is not encrypted or secure.
A prior thought had been to allow different passwords on the environment and the databases within. However, such a scheme, then requires that the password be logged in order for recovery to be able to restore the database. Therefore, any application having the password for the log could get the password for any databases by reading the log. So having a different password on a database does not gain any additional security and it makes certain things harder and more complex. Some of those more complex things include the need to handle database and env passwords differently since they'd need to be stored and accessed from different places. Also resolving the issue of how db_checkpoint or db_sync, which flush database pages to disk, would find the passwords of various databases without any dbps was unsolved. The feature didn't gain anything and caused significant pain. Therefore the decision is that there will be a single password protecting an environment and all the logs and some databases within that environment. We do allow users to have a secure environment and clear databases. Users that want secure databases within a secure environment must set a flag.
Users wishing to enable encryption on a database in a secure environment or enable just checksumming on their database pages will use new flags to DB->set_flags(). Providing ciphering over an entire environment is accomplished by adding a single environment method: DBENV->set_encrypt(). Providing encryption for a database (not part of an environment) is accomplished by adding a new database method: DB->set_encrypt().
Both of the set_encrypt methods must be called before their respective open calls. The environment method must be before the environment open because we must know about security before there is any possibility of writing any log records out. The database method must be before the database open in order to read the root page. The planned interfaces for these methods are:
DBENV->set_encrypt(DBENV *dbenv, /* DB_ENV structure */ char *passwd /* Password */ u_int32_t flags); /* Flags */
DB->set_encrypt(DB *dbp, /* DB structure */ char *passwd /* Password */ u_int32_t flags); /* Flags */The flags accepted by these functions are:
#define DB_ENCRYPT_AES 0x00000001 /* Use the AES encryption algorithm */Passwords are NULL-terminated strings. NULL or zero length strings are illegal. These flags enable the checksumming and encryption using the particular algorithms we have chosen for this implementation. The flags are named such that there is a logical naming pattern if additional checksum or encryption algorithms are used. If a user gives a flag of zero, it will behave in a manner similar to DB_UNKNOWN. It will be illegal if they are creating the environment or database, as an algorithm must be specified. If they are joining an existing environment or opening an existing database, they will use whatever algorithm is in force at the time. Using DB_ENCRYPT_AES automatically implies SHA1 checksumming.
These functions will perform several initialization steps. We will allocate crypto_handle for our env handle and set up our function pointers. We will allocate space and copy the password into our env handle password area. Similar to DB->set_cachesize, calling DB->set_encrypt will actually reflect back into the local environment created by DB.
Lastly, we will add a new flag, DB_OVERWRITE, to the DBENV->remove method. The purpose of this flag is to force all of the memory used by the shared regions to be overwritten before removal. We will use rm_overwrite, a function that overwrites and syncs a file 3 times with varying bit patterns to really remove a file. Additionally, this flag will force a sync of the overwritten regions to disk, if the regions are backed by the file system. That way there is no residual information left in the clear in memory or freed disk blocks. Although we expect that this flag will be used by customers using security, primarily, its action is not dependent on passwords or a secure setup, and so can be used by anyone.
void *crypto_handle; /* Security handle */The crypto handle will really point to a new __db_cipher structure that will contain a set of functions and a pointer to the in-memory information needed by the specific encryption algorithm. It will look like:
typedef struct __db_cipher { int (*init)__P((...)); /* Alg-specific initialization function */ int (*encrypt)__P((...)); /* Alg-specific encryption algorithm */ int (*decrypt)__P((...)); /* Alg-specific decryption function */ void *data; /* Pointer to alg-specific information (AES_CIPHER) */ u_int32_t flags; /* Cipher flags */ } DB_CIPHER;
#define DB_MAC_KEY 20 /* Size of the MAC key */ typedef struct __aes_cipher { keyInstance encrypt_ki; /* Encrypt keyInstance temp. */ keyInstance decrypt_ki; /* Decrypt keyInstance temp. */ u_int8_t mac_key[DB_MAC_KEY]; /* MAC key */ u_int32_t flags; /* AES-specific flags */ } AES_CIPHER;It should be noted that none of these structures have their own mutex. We hold the environment region locked while we are creating this, but once this is set up, it is read-only forever.
During dbenv->set_encrypt, we set the encryption, decryption and checksumming methods to the appropriate functions based on the flags. This function will allocate us a crypto handle that we store in the DB_ENV structure just like all the other subsystems. For now, only AES ciphering functions and SHA1 checksumming functions are supported. Also we will copy the password into the DB_ENV structure. We ultimately need to keep the password in the environment's shared memory region or compare this one against the one that is there, if we are joining an existing environment, but we do not have it yet because open has not yet been called. We will allocate a structure that will be used in initialization and set up the function pointers to point to the algorithm-specific functions.
In the __env_open path, in __db_e_attach, if we are creating the region and the dbenv->passwd field is set, we need to use the length of the password in the initial computation of the environment's size. This guarantees sufficient space for storing the password in shared memory. Then we will call a new function to initialize the security region, __crypto_region_init in __env_open. If we are the creator, we will allocate space in the shared region to store the password and copy the password into that space. Or, if we are not the creator we will compare the password stored in the dbenv with the one in shared memory. Additionally, we will compare the ciphering algorithm to the one stored in the shared region.We'll smash the dbenv password and free it. If they do not match, we return an error. If we are the creator we store the offset into the REGENV structure. Then __crypto_region_init will call the initialization function set up earlier based on the ciphering algorithm specified. For now we will call __aes_init. Additionally this function will allocate and set up the per-process state vector for this encryption's IVs. See Generating the Initialization Vector for a detailed description of the IV and state vector.
In the AES-specific initialization function, __aes_init, we will initialize it by calling __aes_derivekeys in order to fill in the keyInstance and mac_key fields in that structure. The REGENV structure will have one additional item
roff_t passwd_off; /* Offset of passwd */
Also, we will need to add a flag in the database meta-data page that indicates that the database is encrypted and what its algorithm is. This will be used when the meta-page is read after reopening a file. We need this information on the meta-page in order to detect a user opening a secure database without a password. I propose using the first unused1 byte (renaming it too) in the meta page for this purpose.
All pages will not be encrypted for the first 64 bytes of data. Database meta-pages will be encrypted on the first 512 bytes only. All meta-page types will have an IV and checksum added within the first 512 bytes as well as a crypto magic number. This will expand the size of the meta-page from 256 bytes to 512 bytes. The page in/out routines, __db_pgin and __db_pgout know the page type of the page and will apply the 512 bytes ciphering to meta pages. In __db_pgout, if we have a crypto handle in our (private) environment, we will apply ciphering to either the entire page, or the first 512 bytes if it is a meta-page. In __db_pgin, we will decrypt if the page we have a crypto handle.
When multiple processes share a database, all must use the same password as the database creator. Using an existing database requires several conditions to be true. First, if the creator of the database did not create with security, then opening later with security is an error. Second, if the creator did create it with security, then opening later without security is an error. Third, we need to be able to test and check that when another process opens a secure database that the password they provided is the same as the one in use by the creator.
When reading the meta-page, in __db_file_setup, we do not go through the paging functions, but directly read via __os_read. It is at this point that we will determine if the user is configured correctly. If the meta-page we read has an IV and checksum, they better have a crypto handle. If they have a crypto handle, then the meta-page must have an IV and checksum. If both of those are true, we test the password. We compare the unencrypted magic number to the newly-decrypted crypto magic number and if they are not the same, then we report that the user gave us a bad password.
On a mostly unrelated topic, even when we go to very large pagesizes, the meta information will still be within a disk sector. So, after talking it over with Keith and Margo, we determined that unencrypted meta-pages still will not need a checksum.
__aes_derivekeys(DB_ENV *dbenv, /* dbenv */ u_int8_t *passwd, /* Password */ size_t passwd_len, /* Length of passwd */ u_int8_t *mac_key, /* 20 byte array to store MAC key */ keyInstance *encrypt_key, /* Encryption key of passwd */ keyInstance *decrypt_key); /* Decryption key of passwd */This is the only function requiring the textual user password. From the password, this function generates a key used in the checksum function, __db_chksum. It also fills in keyInstance structures which are then used in the encryption and decryption routines. The keyInstance structures must already be allocated. These will be stored in the AES_CIPHER structure.
__db_chksum(u_int8_t *data, /* Data to checksum */ size_t data_len, /* Length of data */ u_int8_t *mac_key, /* 20 byte array from __db_derive_keys */ u_int8_t *checksum); /* 20 byte array to store checksum */This function generates a checksum on the data given. This function will do double-duty for users that simply want error detection on their pages. When users are using encryption, the mac_key will contain the 20-byte key set up in __aes_derivekeys. If they just want checksumming, then mac_key will be NULL. According to Adam, we can safely use the first N-bytes of the checksum. So for seeding the generator for initialization vectors, we'll hash the time and then send in the first 4 bytes for the seed. I believe we can probably do the same thing for checksumming log records. We can only use 4 bytes for the checksum in the non-secure case. So when we want to verify the log checksum we can compute the mac but just compare the first 4 bytes to the one we read. All locations where we generate or check log record checksums that currently call __ham_func4 will now call __db_chksum. I believe there are 5 such locations, __log_put, __log_putr, __log_newfile, __log_rep_put and __txn_force_abort.
__aes_encrypt(DB_ENV *dbenv, /* dbenv */ keyInstance *key, /* Password key instance from __db_derive_keys */ u_int8_t *iv, /* Initialization vector */ u_int8_t *data, /* Data to encrypt */ size_t data_len); /* Length of data to encrypt - 16 byte multiple */This is the function to encrypt data. It will be called to encrypt pages and log records. The key instance is initialized in __aes_derivekeys. The initialization vector, iv, is the 16 byte random value set up by the Mersenne Twister pseudo-random generator. Lastly, we pass in a pointer to the data to encrypt and its length in data_len. The data_len must be a multiple of 16 bytes. The encryption is done in-place so that when the encryption code returns our encrypted data is in the same location as the original data.
__aes_decrypt(DB_ENV *dbenv, /* dbenv */ keyInstance *key, /* Password key instance from __db_derive_keys */ u_int8_t *iv, /* Initialization vector */ u_int8_t *data, /* Data to decrypt */ size_t data_len); /* Length of data to decrypt - 16 byte multiple */This is the function to decrypt the data. It is exactly the same as the encryption function except for the action it performs. All of the args and issues are the same. It also decrypts in place.
We will not be holding any locks when we need to generate our IV but we need to protect access to the state vector and the index. Calls to the MT code will come while encrypting some data in __aes_encrypt. The MT code will assume that all necessary locks are held in the caller. We will have per-process state vectors that are set up when a process begins. That way we minimize the contention and only multi-threaded processes need acquire locks for the IV. We will have the state vector in the environment handle in heap memory, as well as the index and there will be a mutex protecting it for threaded access. This will be added to the DB_ENV structure:
DB_MUTEX *mt_mutexp; /* Mersenne Twister mutex */ int *mti; /* MT index */ u_long *mt; /* MT state vector */This portion of the environment will be initialized at the end of __dbenv_open, right after we initialize the other mutex for the dblist. When we allocate the space, we will generate our initial state vector. If we are multi-threaded we'll allocate and initialize our mutex also.
We need to make changes to the MT code to make it work in our namespace and to take a pointer to the location of the state vector and the index. There will be a wrapper function __db_generate_iv that DB will call and it will call the appropriate MT function. I am also going to change the default seed to use a hashed time instead of a hard coded value. I have looked at other implementations of the MT code available on the web site. The C++ version does a hash on the current time. I will modify our MT code to seed with the hashed time as well. That way the code to seed is contained within the MT code and we can just write the wrapper to get an IV. We will not be changing the core computational code of MT.
For ciphering log records, the encryption will be done as the first
thing (or a new wrapper) in __log_put. See Log
Record Encryption for those details.
Since several paging macros use inp[X] in them, those macros must now take a dbp. There are a lot of changes to make all the necessary paging macros take a dbp, although these changes are trivial in nature.
Also, there is a new function __db_chk_meta to perform checksumming and decryption checking on meta pages specifically. This function is where we check that the database algorithm matches what the user gave (or if they set DB_CIPHER_ANY then we set it), and other encryption related testing for bad combinations of what is in the file versus what is in the user structures.
If we get a checksum error, then we need to log a message stating a checksum error occurred on page N. In __db_pgin, we can check if logging is on in the environment. If so, we want to log the message.
When the application gets the DB_RUNRECOVERY error, they'll have to shut down their application and run recovery. When the recovery encounters the record indicating checksum failure, then normal recovery will fail and the user will have to perform catastrophic recovery. When catastrophic recovery encounters that record, it will simply ignore it.
On reading the log, via log cursors, the log code stores log records in the log buffer. Records in that buffer will be encrypted, so decryption will occur no matter whether we are returning records from the buffer or if we are returning log records directly from the disk. Current checksum checking is done in __log_get_c_int. Decryption will be done after the checksum is checked.
There are currently two nasty issues with encrypted log records. The first is that __txn_force_abort overwrites a commit record in the log buffer with an abort record. Well, our log buffer will be encrypted. Therefore, __txn_force_abort is going to need to do encryption of its new record. This can be accomplished by sending in the dbenv handle to the function. It is available to us in __log_flush_commit and we can just pass it in. I don't like putting log encryption in the txn code, but the layering violation is already there.
The second issue is that the encryption code requires data that is a multiple of 16 bytes and log record lengths are variable. We will need to pad log records to meet the requirement. Since the callers of __log_put set up the given DBT it is a logical place to pad if necessary. We will modify the gen_rec.awk script to have all of the generated logging functions pad for us if we have a crypto handle. This padding will also expand the size of log files. Anyone calling log_put and using security from the application will have to pad on their own or it will return an error.
When ciphering the log file, we will need a different header than the current one. The current header only has space for a 4 byte checksum. Our secure header will need space for the 16 byte IV and 20 byte checksum. This will blow up our log files when running securely since every single log record header will now consume 32 additional bytes. I believe that the log header does not need to be encrypted. It contains an offset, a length and our IV and checksum. Our IV and checksum are never encrypted. I don't believe there to be any risk in having the offset and length in the clear.
I would prefer not to have two types of log headers that are incompatible with each other. It is not acceptable to increase the log headers of all users from 12 bytes to 44 bytes. Such a change would also make log files incompatible with earlier releases. Worse even, is that the cksum field of the header is in between the offset and len. It would be really convenient if we could have just made a bigger cksum portion without affecting the location of the other fields. Oh well. Most customers will not be using encryption and we won't make them pay the price of the expanded header. Keith indicates that the log file format is changing with the next release so I will move the cksum field so it can at least be overlaid.
One method around this would be to have a single internal header that contains all the information both mechanisms need, but when we write out the header we choose which pieces to write. By appending the security information to the end of the existing structure, and adding a size field, we can modify a few places to use the size field to write out only the current first 12 bytes, or the entire security header needed.
On the master side we must copy the DBT sent in. We encrypt the original and send to clients the clear record. On the client side, support for encryption is added into __log_rep_put.
Joining an existing environment requires several conditions to be true. First, if the creator of the environment did not create with security, then joining later with security is an error. Second, if the creator did create it with security, then joining later without security is an error. Third, we need to be able to test and check that when another process joins a secure environment that the password they provided is the same as the one in use by the creator.
The first two scenarios should be fairly trivial to determine, if we aren't creating the environment, we can compare what is there with what we have. In the third case, the __crypto_region_init function will see that the environment region has a valid passwd_off and we'll then compare that password to the one we have in our dbenv handle. In any case we'll smash the dbenv handle's passwd and free that memory before returning whether we have a password match or not.
We need to store the passwords themselves in the region because multiple calls to the __aes_derivekeys function with the same password yields different keyInstance contents. Therefore we don't have any way to check passwords other than retaining and comparing the actual passwords.
The first is that all of the pages are stored in memory and possibly the file system in the clear. The password is stored in the shared data regions in the clear. Therefore if an attacker can read the process memory, they can do whatever they want. If the attacker can read system memory or swap they can access the data as well. Since everything in the shared data regions (with the exception of the buffered log) will be in the clear, it is important to realize that file backed regions will be written in the clear, including the portion of the regions containing passwords. We recommend to users that they use system memory instead of file backed shared memory.