#if !defined(OMT_H) #define OMT_H #ident "Copyright (c) 2008 Tokutek Inc. All rights reserved." // Order Maintenance Tree (OMT) // // Maintains a collection of totally ordered values, where each value has an integer weight. // The OMT is a mutable datatype. // // The Abstraction: // // An OMT is a vector of values, $V$, where $|V|$ is the length of the vector. // The vector is numbered from $0$ to $|V|-1$. // Each value has a weight. The weight of the $i$th element is denoted $w(V_i)$. // // We can create a new OMT, which is the empty vector. // // We can insert a new element $x$ into slot $i$, changing $V$ into $V'$ where // $|V'|=1+|V|$ and // // V'_j = V_j if $ji$. // // We can specify $i$ using a kind of function instead of as an integer. // Let $b$ be a function mapping from values to nonzero integers, such that // the signum of $b$ is monotically increasing. // We can specify $i$ as the minimum integer such that $b(V_i)>0$. // // We look up a value using its index, or using a Heaviside function. // For lookups, we allow $b$ to be zero for some values, and again the signum of $b$ must be monotonically increasing. // When lookup up values, we can look up // $V_i$ where $i$ is the minimum integer such that $b(V_i)=0$. (With a special return code if no such value exists.) // (Rationale: Ordinarily we want $i$ to be unique. But for various reasons we want to allow multiple zeros, and we want the smallest $i$ in that case.) // $V_i$ where $i$ is the minimum integer such that $b(V_i)>0$. (Or an indication that no such value exists.) // $V_i$ where $i$ is the maximum integer such that $b(V_i)<0$. (Or an indication that no such value exists.) // // When looking up a value using a Heaviside function, we get the value and its index. // // We can also split an OMT into two OMTs, splitting the weight of the values evenly. // Find a value $j$ such that the values to the left of $j$ have about the same total weight as the values to the right of $j$. // The resulting two OMTs contain the values to the left of $j$ and the values to the right of $j$ respectively. // All of the values from the original OMT go into one of the new OMTs. // If the weights of the values don't split exactly evenly, then the implementation has the freedom to choose whether // the new left OMT or the new right OMT is larger. // // Performance: // Insertion and deletion should run with $O(\log |V|)$ time and $O(\log |V|)$ calls to the Heaviside function. // The memory required is O(|V|). // // OMTs also support cursors. An OMTCURSOR is a mutable // An OMTCURSOR is a mutable object that, at any moment in time, is // either associated with a single OMT or is not associated with any // OMT. Many different OMTCURSORs can be associated with a single OMT. // // We say that an OMTCURSOR is *invalid* if it is not currently // associated with an OMT. // // Abstractly, an OMTCURSOR simply contains an integer offset of a // particular OMTVALUE. We call this abstract integer the *offset*. // Note, however, that the implementation may use a more // complex representation in order to obtain higher performance. // (Note: A first implementation might use the integer.) // // Given a valid OMTCURSOR, one // * obtain the OMTVALUE at which the integer points in O(1) time, // * increment or decrement the abstract integer (usually quickly.) // The requirements are that the cursor is initialized to a // randomly chosen valid integer, then the integer can be // incremented in O(1) expected time. // The OMTCURSOR may become invalidated under several conditions: // * Incrementing or decrementing the abstract integer out of its // valid range invalidates the OMTCURSOR. // * If the OMT is modified, it may invalidate the cursor. // * The user of the OMTCURSOR may explicitly invalidate the cursor. // * The OMT is destroyed (in which case the OMTCURSOR is // invalidated, but not destroyed.) // Implementation Hints // // One way to implement the OMTCURSOR is with an integer. The problem // is that obtaining the value at which the integer // points takes O(\log n) time, which is not fast enough to meet the // specification. However, this implementation is probably much // faster than our current implementation because it is O(\log n) // integer comparisons instead of O(\log n) key comparisons. This // simple implementation may be the right thing for a first cut. // // To actually achieve the performance requirements, here's a better // implementation: The OMTCURSOR contains a path from root to leaf. // Fetching the current value is O(1) time since the leaf is // immediately accessible. Modifying the path to find the next or // previous item has O(1) expected time at a randomly chosen valid // point // // The path can be implemented as an array. It probably makes sense // for the array to by dynamically resized as needed. Since the // array's size is O(log n), it is not necessary to ever shrink the // array. Also, from the perspective of testing, it's probably best // if the array is initialized to a short length (e.g., length 4) so // that the doubling code is actually exercised. // // One way to implement invalidation is for each OMT to maintain a // doubly linked list of OMTCURSORs. When destroying an OMT or // changing the OMT's shape, one can simply step through the list // invalidating all the OMTCURSORs. // // The list of OMTCURSORs should use the list.h abstraction. If it's // not clear how to use it, Rich can explain it. // Usage Hint: The OMTCURSOR is designed to be used inside the // BRTcursor. A BRTcursor includes a pointer to an OMTCURSOR, which // is created when the BRTcursor is created. // // The brt cursor implements its search by first finding a leaf node, // containing an OMT. The BRT then passes its OMTCURSOR into the lookup // method (i.e., one of toku_ebdomt_fetch, toku_omt_find_zero, // toku_omt_find). The lookup method, if successful, sets the // OMTCURSOR to refer to that element. // // As long as the OMTCURSOR remains valid, a BRTCURSOR next or prev // operation can be implemented using next or prev on the OMTCURSOR. // // If the OMTCURSOR becomes invalidated, then the BRT must search // again from the root of the tree. The only error that an OMTCURSOR // next operation can raise is that it is invalid. // // If an element is inserted into the BRT, it may cause an OMTCURSOR // to become invalid. This is especially true if the element will end // up in the OMT associated with the cursor. A simple implementation // is to invalidate all OMTCURSORS any time anything is inserted into // into the BRT. Since the BRT already contains a list of BRT cursors // associated with it, it is straightforward to go through that list // and invalidate all the cursors. // // When the BRT closes a cursor, it destroys the OMTCURSOR. // The programming API: //typedef struct value *OMTVALUE; // A slight improvement over using void*. typedef struct omt *OMT; typedef struct omt_cursor *OMTCURSOR; int toku_omt_create (OMT *omtp); // Effect: Create an empty OMT. Stores it in *omtp. // Requires: omtp != NULL // Returns: // 0 success // ENOMEM out of memory (and doesn't modify *omtp) // Performance: constant time. int toku_omt_create_from_sorted_array(OMT *omtp, OMTVALUE *values, u_int32_t numvalues); // Effect: Create a OMT containing values. The number of values is in numvalues. // Stores the new OMT in *omtp. // Requires: omtp != NULL // Requires: values != NULL // Requires: values is sorted // Returns: // 0 success // ENOMEM out of memory (and doesn't modify *omtp) // Performance: time=O(numvalues) // Rational: Normally to insert N values takes O(N lg N) amortized time. // If the N values are known in advance, are sorted, and // the structure is empty, we can batch insert them much faster. void toku_omt_destroy(OMT *omtp); // Effect: Destroy an OMT, freeing all its memory. // Does not free the OMTVALUEs stored in the OMT. // Those values may be freed before or after calling toku_omt_destroy. // Also sets *omtp=NULL. // Requires: omtp != NULL // Requires: *omtp != NULL // Rationale: The usage is to do something like // toku_omt_destroy(&s->omt); // and now s->omt will have a NULL pointer instead of a dangling freed pointer. // Rationale: Returns no values since free() cannot fail. // Rationale: Does not free the OMTVALUEs to reduce complexity. // Performance: time=O(toku_omt_size(*omtp)) u_int32_t toku_omt_size(OMT V); // Effect: return |V|. // Requires: V != NULL // Performance: time=O(1) int toku_omt_iterate_on_range(OMT omt, u_int32_t left, u_int32_t right, int (*f)(OMTVALUE, u_int32_t, void*), void*v); // Effect: Iterate over the values of the omt, from left to right, calling f on each value. // The second argument passed to f is the index of the value. // The third argument passed to f is v. // The indices run from 0 (inclusive) to toku_omt_size(omt) (exclusive). // We will iterate only over [left,right) // // Requires: omt != NULL // left <= right // Requires: f != NULL // Returns: // If f ever returns nonzero, then the iteration stops, and the value returned by f is returned by toku_omt_iterate. // If f always returns zero, then toku_omt_iterate returns 0. // Requires: Don't modify omt while running. (E.g., f may not insert or delete values form omt.) // Performance: time=O(i+\log N) where i is the number of times f is called, and N is the number of elements in omt. // Rational: Although the functional iterator requires defining another function (as opposed to C++ style iterator), it is much easier to read. int toku_omt_iterate(OMT omt, int (*f)(OMTVALUE, u_int32_t, void*), void*v); // Effect: Iterate over the values of the omt, from left to right, calling f on each value. // The second argument passed to f is the index of the value. // The third argument passed to f is v. // The indices run from 0 (inclusive) to toku_omt_size(omt) (exclusive). // Requires: omt != NULL // Requires: f != NULL // Returns: // If f ever returns nonzero, then the iteration stops, and the value returned by f is returned by toku_omt_iterate. // If f always returns zero, then toku_omt_iterate returns 0. // Requires: Don't modify omt while running. (E.g., f may not insert or delete values form omt.) // Performance: time=O(i+\log N) where i is the number of times f is called, and N is the number of elements in omt. // Rational: Although the functional iterator requires defining another function (as opposed to C++ style iterator), it is much easier to read. int toku_omt_insert_at(OMT omt, OMTVALUE value, u_int32_t idx); // Effect: Increases indexes of all items at slot >= index by 1. // Insert value into the position at index. // // Returns: // 0 success // EINVAL if index>toku_omt_size(omt) // ENOMEM // On error, omt is unchanged. // Performance: time=O(\log N) amortized time. // Rationale: Some future implementation may be O(\log N) worst-case time, but O(\log N) amortized is good enough for now. int toku_omt_set_at (OMT omt, OMTVALUE value, u_int32_t idx); // Effect: Replaces the item at index with value. // Returns: // 0 success // EINVAL if index>=toku_omt_size(omt) // On error, omt i sunchanged. // Performance: time=O(\log N) // Rationale: The BRT needs to be able to replace a value with another copy of the same value (allocated in a different location) int toku_omt_insert(OMT omt, OMTVALUE value, int(*h)(OMTVALUE, void*v), void *v, u_int32_t *idx); // Effect: Insert value into the OMT. // If there is some i such that $h(V_i, v)=0$ then returns DB_KEYEXIST. // Otherwise, let i be the minimum value such that $h(V_i, v)>0$. // If no such i exists, then let i be |V| // Then this has the same effect as // omt_insert_at(tree, value, i); // If index!=NULL then i is stored in *index // Requires: The signum of h must be monotonically increasing. // Returns: // 0 success // DB_KEYEXIST the key is present (h was equal to zero for some value) // ENOMEM // On nonzero return, omt is unchanged. // On nonzero non-DB_KEYEXIST return, *index is unchanged. // Performance: time=O(\log N) amortized. // Rationale: Some future implementation may be O(\log N) worst-case time, but O(\log N) amortized is good enough for now. int toku_omt_delete_at(OMT omt, u_int32_t idx); // Effect: Delete the item in slot index. // Decreases indexes of all items at slot >= index by 1. // Returns // 0 success // EINVAL if index>=toku_omt_size(omt) // On error, omt is unchanged. // Rationale: To delete an item, first find its index using toku_omt_find, then delete it. // Performance: time=O(\log N) amortized. void toku_omt_cursor_set_index(OMTCURSOR c, u_int32_t idx); // Effect: // Set the abstract index. // Requires: // The cursor is not invalid. int toku_omt_fetch (OMT V, u_int32_t i, OMTVALUE *v, OMTCURSOR c); // Effect: Set *v=V_i // If c!=NULL then set c's abstract offset to i. // Requires: v != NULL // Returns // 0 success // EINVAL if index>=toku_omt_size(omt) // On nonzero return, *v is unchanged, and c (if nonnull) is either // invalidated or unchanged. // Performance: time=O(\log N) // Implementation Notes: It is possible that c was previously valid and was // associated with a different OMT. If c is changed by this // function, the function must remove c's association with the old // OMT, and associate it with the new OMT. int toku_omt_find_zero(OMT V, int (*h)(OMTVALUE, void*extra), void*extra, OMTVALUE *value, u_int32_t *idx, OMTCURSOR c); // Effect: Find the smallest i such that h(V_i, extra)>=0 // If there is such an i and h(V_i,extra)==0 then set *index=i and return 0. // If there is such an i and h(V_i,extra)>0 then set *index=i and return DB_NOTFOUND. // If there is no such i then set *index=toku_omt_size(V) and return DB_NOTFOUND. // Requires: index!=NULL int toku_omt_find(OMT V, int (*h)(OMTVALUE, void*extra), void*extra, int direction, OMTVALUE *value, u_int32_t *idx, OMTCURSOR c); // Effect: // If direction >0 then find the smallest i such that h(V_i,extra)>0. // If direction <0 then find the largest i such that h(V_i,extra)<0. // (Direction may not be equal to zero.) // If value!=NULL then store V_i in *value // If index!=NULL then store i in *index. // Requires: The signum of h is monotically increasing. // Returns // 0 success // DB_NOTFOUND no such value is found. // On nonzero return, *value and *index are unchanged, and c (if nonnull) is either invalidated // or unchanged. // Performance: time=O(\log N) // Rationale: // Here's how to use the find function to find various things // Cases for find: // find first value: ( h(v)=+1, direction=+1 ) // find last value ( h(v)=-1, direction=-1 ) // find first X ( h(v)=(v< x) ? -1 : 1 direction=+1 ) // find last X ( h(v)=(v<=x) ? -1 : 1 direction=-1 ) // find X or successor to X ( same as find first X. ) // // Rationale: To help understand heaviside functions and behavor of find: // There are 7 kinds of heaviside functions. // The signus of the h must be monotonically increasing. // Given a function of the following form, A is the element // returned for direction>0, B is the element returned // for direction<0, C is the element returned for // direction==0 (see find_zero) (with a return of 0), and D is the element // returned for direction==0 (see find_zero) with a return of DB_NOTFOUND. // If any of A, B, or C are not found, then asking for the // associated direction will return DB_NOTFOUND. // See find_zero for more information. // // Let the following represent the signus of the heaviside function. // // -...- // A // D // // +...+ // B // D // // 0...0 // C // // -...-0...0 // AC // // 0...0+...+ // C B // // -...-+...+ // AB // D // // -...-0...0+...+ // AC B int toku_omt_split_at(OMT omt, OMT *newomt, u_int32_t idx); // Effect: Create a new OMT, storing it in *newomt. // The values to the right of index (starting at index) are moved to *newomt. // Requires: omt != NULL // Requires: newomt != NULL // Returns // 0 success, // EINVAL if index > toku_omt_size(omt) // ENOMEM // On nonzero return, omt and *newomt are unmodified. // Performance: time=O(n) // Rationale: We don't need a split-evenly operation. We need to split items so that their total sizes // are even, and other similar splitting criteria. It's easy to split evenly by calling toku_omt_size(), and dividing by two. int toku_omt_merge(OMT leftomt, OMT rightomt, OMT *newomt); // Effect: Appends leftomt and rightomt to produce a new omt. // Sets *newomt to the new omt. // On success, leftomt and rightomt destroyed,. // Returns 0 on success // ENOMEM on out of memory. // On error, nothing is modified. // Performance: time=O(n) is acceptable, but one can imagine implementations that are O(\log n) worst-case. void toku_omt_clear(OMT omt); // Effect: Set the tree to be empty. // Note: Will not reallocate or resize any memory, since returning void precludes calling malloc. // Performance: time=O(1) unsigned long toku_omt_memory_size (OMT omt); // Effect: Return the size (in bytes) of the omt, as it resides in main memory. Don't include any of the OMTVALUES. int toku_omt_cursor_create (OMTCURSOR *p); // Effect: Create an OMTCURSOR. Stores it in *p. The OMTCURSOR is // initially invalid. // Requires: p != NULL // Returns: // 0 success // ENOMEM out of memory (and doesn't modify *omtp) // Performance: constant time. void toku_omt_cursor_destroy (OMTCURSOR *p); // Effect: Invalidates *p (if it is valid) and frees any memory // associated with *p. // Also sets *p=NULL. // Rationale: The usage is to do something like // toku_omt_cursor_destroy(&c); // and now c will have a NULL pointer instead of a dangling freed pointer. // Rationale: Returns no values since free() cannot fail. int toku_omt_cursor_is_valid (OMTCURSOR c); // Effect: returns 0 iff c is invalid. // Performance: time=O(1) int toku_omt_cursor_next (OMTCURSOR c, OMTVALUE *v); // Effect: Increment c's offset, and find and store the value in v. // Requires: v != NULL // Returns // 0 success // EINVAL if the offset goes out of range or c is invalid. // On nonzero return, *v is unchanged and c is invalidated. // Performance: time=O(log N) worst case, expected time=O(1) for a randomly // chosen initial position. int toku_omt_cursor_current_index(OMTCURSOR c, u_int32_t *idx); // Effect: Stores c's offset in *index. // Requires: index != NULL // Returns // 0 success // EINVAL c is invalid // On nonzero return, *index is unchanged and c is unchanged. // Performance: time=O(1) OMT toku_omt_cursor_get_omt(OMTCURSOR c); // Effect: returns the associated omt or NULL if not associated. // Performance: time=O(1) int toku_omt_cursor_current (OMTCURSOR c, OMTVALUE *v); // Effect: Store in v the value pointed by c's abstract offset // Requires: v != NULL // Returns // 0 success // EINVAL if c is invalid // On non-zero return, *v is unchanged // Performance: O(1) time int toku_omt_cursor_prev (OMTCURSOR c, OMTVALUE *v); // Effect: Decrement c's offset, and find and store the value in v. // Requires: v != NULL // Returns // 0 success // EINVAL if the offset goes out of range or c is invalid. // On nonzero return, *v is unchanged and c is invalidated. // Performance: time=O(log N) worst case, expected time=O(1) for a randomly // chosen initial position. void toku_omt_cursor_invalidate (OMTCURSOR c); // Effect: Invalidate c. (This does not mean that c is destroyed or // that its memory is freed.) // If c is valid, the invalidate callback function (if any) will be called // before invalidating c. void toku_omt_cursor_set_invalidate_callback(OMTCURSOR c, void (*f)(OMTCURSOR,void*), void* extra); // Effect: // Saves function 'f' to be called whenever the cursor is invalidated. // 'extra' is passed as an additional parameter to f. // Requires: // The lifetime of the 'extra' parameter must continue at least till the cursor // is destroyed. void toku_omt_cursor_associate(OMT omt, OMTCURSOR c); // Effect: // Associates an omtcursor with an omt. // Requires: // The omtcursor is not associated with any other omt. // Requires: // toku_omt_associate must be called when the omt-lock is held // Rationale: // This is used by brt_cursors for omts representing leaf nodes. // These omts are touched by multiple threads, and therefore require locks // for modifying the list of omtcursors. // We do not want to grab two locks (one for omt, and one for the old // associated omt). #endif /* #ifndef OMT_H */