mariadb/newbrt/omt.h
Yoni Fogel 681c5dd336 Addresses #1866 List of live transactions (in logger) is now an omt instead of a linked list.
refs [t:1866]

git-svn-id: file:///svn/toku/tokudb@13604 c7de825b-a66e-492c-adef-691d508d4ae1
2013-04-16 23:57:56 -04:00

508 lines
22 KiB
C

#if !defined(OMT_H)
#define OMT_H
#ident "$Id$"
#ident "Copyright (c) 2007, 2008, 2009 Tokutek Inc. All rights reserved."
#ident "The technology is licensed by the Massachusetts Institute of Technology, Rutgers State University of New Jersey, and the Research Foundation of State University of New York at Stony Brook under United States of America Serial No. 11/760379 and to the patents and/or patent applications resulting from it."
// Order Maintenance Tree (OMT)
//
// Maintains a collection of totally ordered values, where each value has an integer weight.
// The OMT is a mutable datatype.
//
// The Abstraction:
//
// An OMT is a vector of values, $V$, where $|V|$ is the length of the vector.
// The vector is numbered from $0$ to $|V|-1$.
// Each value has a weight. The weight of the $i$th element is denoted $w(V_i)$.
//
// We can create a new OMT, which is the empty vector.
//
// We can insert a new element $x$ into slot $i$, changing $V$ into $V'$ where
// $|V'|=1+|V|$ and
//
// V'_j = V_j if $j<i$
// x if $j=i$
// V_{j-1} if $j>i$.
//
// We can specify $i$ using a kind of function instead of as an integer.
// Let $b$ be a function mapping from values to nonzero integers, such that
// the signum of $b$ is monotically increasing.
// We can specify $i$ as the minimum integer such that $b(V_i)>0$.
//
// We look up a value using its index, or using a Heaviside function.
// For lookups, we allow $b$ to be zero for some values, and again the signum of $b$ must be monotonically increasing.
// When lookup up values, we can look up
// $V_i$ where $i$ is the minimum integer such that $b(V_i)=0$. (With a special return code if no such value exists.)
// (Rationale: Ordinarily we want $i$ to be unique. But for various reasons we want to allow multiple zeros, and we want the smallest $i$ in that case.)
// $V_i$ where $i$ is the minimum integer such that $b(V_i)>0$. (Or an indication that no such value exists.)
// $V_i$ where $i$ is the maximum integer such that $b(V_i)<0$. (Or an indication that no such value exists.)
//
// When looking up a value using a Heaviside function, we get the value and its index.
//
// We can also split an OMT into two OMTs, splitting the weight of the values evenly.
// Find a value $j$ such that the values to the left of $j$ have about the same total weight as the values to the right of $j$.
// The resulting two OMTs contain the values to the left of $j$ and the values to the right of $j$ respectively.
// All of the values from the original OMT go into one of the new OMTs.
// If the weights of the values don't split exactly evenly, then the implementation has the freedom to choose whether
// the new left OMT or the new right OMT is larger.
//
// Performance:
// Insertion and deletion should run with $O(\log |V|)$ time and $O(\log |V|)$ calls to the Heaviside function.
// The memory required is O(|V|).
//
// OMTs also support cursors. An OMTCURSOR is a mutable
// An OMTCURSOR is a mutable object that, at any moment in time, is
// either associated with a single OMT or is not associated with any
// OMT. Many different OMTCURSORs can be associated with a single OMT.
//
// We say that an OMTCURSOR is *invalid* if it is not currently
// associated with an OMT.
//
// Abstractly, an OMTCURSOR simply contains an integer offset of a
// particular OMTVALUE. We call this abstract integer the *offset*.
// Note, however, that the implementation may use a more
// complex representation in order to obtain higher performance.
// (Note: A first implementation might use the integer.)
//
// Given a valid OMTCURSOR, one
// * obtain the OMTVALUE at which the integer points in O(1) time,
// * increment or decrement the abstract integer (usually quickly.)
// The requirements are that the cursor is initialized to a
// randomly chosen valid integer, then the integer can be
// incremented in O(1) expected time.
// The OMTCURSOR may become invalidated under several conditions:
// * Incrementing or decrementing the abstract integer out of its
// valid range invalidates the OMTCURSOR.
// * If the OMT is modified, it may invalidate the cursor.
// * The user of the OMTCURSOR may explicitly invalidate the cursor.
// * The OMT is destroyed (in which case the OMTCURSOR is
// invalidated, but not destroyed.)
// Implementation Hints
//
// One way to implement the OMTCURSOR is with an integer. The problem
// is that obtaining the value at which the integer
// points takes O(\log n) time, which is not fast enough to meet the
// specification. However, this implementation is probably much
// faster than our current implementation because it is O(\log n)
// integer comparisons instead of O(\log n) key comparisons. This
// simple implementation may be the right thing for a first cut.
//
// To actually achieve the performance requirements, here's a better
// implementation: The OMTCURSOR contains a path from root to leaf.
// Fetching the current value is O(1) time since the leaf is
// immediately accessible. Modifying the path to find the next or
// previous item has O(1) expected time at a randomly chosen valid
// point
//
// The path can be implemented as an array. It probably makes sense
// for the array to by dynamically resized as needed. Since the
// array's size is O(log n), it is not necessary to ever shrink the
// array. Also, from the perspective of testing, it's probably best
// if the array is initialized to a short length (e.g., length 4) so
// that the doubling code is actually exercised.
//
// One way to implement invalidation is for each OMT to maintain a
// doubly linked list of OMTCURSORs. When destroying an OMT or
// changing the OMT's shape, one can simply step through the list
// invalidating all the OMTCURSORs.
//
// The list of OMTCURSORs should use the list.h abstraction. If it's
// not clear how to use it, Rich can explain it.
// Usage Hint: The OMTCURSOR is designed to be used inside the
// BRTcursor. A BRTcursor includes a pointer to an OMTCURSOR, which
// is created when the BRTcursor is created.
//
// The brt cursor implements its search by first finding a leaf node,
// containing an OMT. The BRT then passes its OMTCURSOR into the lookup
// method (i.e., one of toku_ebdomt_fetch, toku_omt_find_zero,
// toku_omt_find). The lookup method, if successful, sets the
// OMTCURSOR to refer to that element.
//
// As long as the OMTCURSOR remains valid, a BRTCURSOR next or prev
// operation can be implemented using next or prev on the OMTCURSOR.
//
// If the OMTCURSOR becomes invalidated, then the BRT must search
// again from the root of the tree. The only error that an OMTCURSOR
// next operation can raise is that it is invalid.
//
// If an element is inserted into the BRT, it may cause an OMTCURSOR
// to become invalid. This is especially true if the element will end
// up in the OMT associated with the cursor. A simple implementation
// is to invalidate all OMTCURSORS any time anything is inserted into
// into the BRT. Since the BRT already contains a list of BRT cursors
// associated with it, it is straightforward to go through that list
// and invalidate all the cursors.
//
// When the BRT closes a cursor, it destroys the OMTCURSOR.
// The programming API:
//typedef struct value *OMTVALUE; // A slight improvement over using void*.
typedef void *OMTVALUE;
typedef struct omt *OMT;
typedef struct omt_cursor *OMTCURSOR;
int toku_omt_create (OMT *omtp);
// Effect: Create an empty OMT. Stores it in *omtp.
// Requires: omtp != NULL
// Returns:
// 0 success
// ENOMEM out of memory (and doesn't modify *omtp)
// Performance: constant time.
int toku_omt_create_from_sorted_array(OMT *omtp, OMTVALUE *values, u_int32_t numvalues);
// Effect: Create a OMT containing values. The number of values is in numvalues.
// Stores the new OMT in *omtp.
// Requires: omtp != NULL
// Requires: values != NULL
// Requires: values is sorted
// Returns:
// 0 success
// ENOMEM out of memory (and doesn't modify *omtp)
// Performance: time=O(numvalues)
// Rational: Normally to insert N values takes O(N lg N) amortized time.
// If the N values are known in advance, are sorted, and
// the structure is empty, we can batch insert them much faster.
int toku_omt_create_steal_sorted_array(OMT *omtp, OMTVALUE **valuesp, u_int32_t numvalues, u_int32_t steal_capacity);
// Effect: Create an OMT containing values. The number of values is in numvalues.
// On success the OMT takes ownership of *valuesp array, and sets valuesp=NULL.
// Requires: omtp != NULL
// Requires: valuesp != NULL
// Requires: *valuesp is sorted
// Requires: *valuesp was allocated with toku_malloc
// Requires: Capacity of the *valuesp array is <= steal_capacity
// Requires: On success, *valuesp may not be accessed again by the caller.
// Returns:
// 0 success
// ENOMEM out of memory (and doesn't modify *omtp)
// EINVAL *valuesp == NULL or numvalues > capacity
// Performance: time=O(1)
// Rational: toku_omt_create_from_sorted_array takes O(numvalues) time.
// By taking ownership of the array, we save a malloc and memcpy,
// and possibly a free (if the caller is done with the array).
void toku_omt_destroy(OMT *omtp);
// Effect: Destroy an OMT, freeing all its memory.
// Does not free the OMTVALUEs stored in the OMT.
// Those values may be freed before or after calling toku_omt_destroy.
// Also sets *omtp=NULL.
// Requires: omtp != NULL
// Requires: *omtp != NULL
// Rationale: The usage is to do something like
// toku_omt_destroy(&s->omt);
// and now s->omt will have a NULL pointer instead of a dangling freed pointer.
// Rationale: Returns no values since free() cannot fail.
// Rationale: Does not free the OMTVALUEs to reduce complexity.
// Performance: time=O(toku_omt_size(*omtp))
u_int32_t toku_omt_size(OMT V);
// Effect: return |V|.
// Requires: V != NULL
// Performance: time=O(1)
int toku_omt_iterate_on_range(OMT omt, u_int32_t left, u_int32_t right, int (*f)(OMTVALUE, u_int32_t, void*), void*v);
// Effect: Iterate over the values of the omt, from left to right, calling f on each value.
// The second argument passed to f is the index of the value.
// The third argument passed to f is v.
// The indices run from 0 (inclusive) to toku_omt_size(omt) (exclusive).
// We will iterate only over [left,right)
//
// Requires: omt != NULL
// left <= right
// Requires: f != NULL
// Returns:
// If f ever returns nonzero, then the iteration stops, and the value returned by f is returned by toku_omt_iterate.
// If f always returns zero, then toku_omt_iterate returns 0.
// Requires: Don't modify omt while running. (E.g., f may not insert or delete values form omt.)
// Performance: time=O(i+\log N) where i is the number of times f is called, and N is the number of elements in omt.
// Rational: Although the functional iterator requires defining another function (as opposed to C++ style iterator), it is much easier to read.
int toku_omt_iterate(OMT omt, int (*f)(OMTVALUE, u_int32_t, void*), void*v);
// Effect: Iterate over the values of the omt, from left to right, calling f on each value.
// The second argument passed to f is the index of the value.
// The third argument passed to f is v.
// The indices run from 0 (inclusive) to toku_omt_size(omt) (exclusive).
// Requires: omt != NULL
// Requires: f != NULL
// Returns:
// If f ever returns nonzero, then the iteration stops, and the value returned by f is returned by toku_omt_iterate.
// If f always returns zero, then toku_omt_iterate returns 0.
// Requires: Don't modify omt while running. (E.g., f may not insert or delete values form omt.)
// Performance: time=O(i+\log N) where i is the number of times f is called, and N is the number of elements in omt.
// Rational: Although the functional iterator requires defining another function (as opposed to C++ style iterator), it is much easier to read.
int toku_omt_insert_at(OMT omt, OMTVALUE value, u_int32_t idx);
// Effect: Increases indexes of all items at slot >= index by 1.
// Insert value into the position at index.
//
// Returns:
// 0 success
// EINVAL if index>toku_omt_size(omt)
// ENOMEM
// On error, omt is unchanged.
// Performance: time=O(\log N) amortized time.
// Rationale: Some future implementation may be O(\log N) worst-case time, but O(\log N) amortized is good enough for now.
int toku_omt_set_at (OMT omt, OMTVALUE value, u_int32_t idx);
// Effect: Replaces the item at index with value.
// Returns:
// 0 success
// EINVAL if index>=toku_omt_size(omt)
// On error, omt i sunchanged.
// Performance: time=O(\log N)
// Rationale: The BRT needs to be able to replace a value with another copy of the same value (allocated in a different location)
int toku_omt_insert(OMT omt, OMTVALUE value, int(*h)(OMTVALUE, void*v), void *v, u_int32_t *idx);
// Effect: Insert value into the OMT.
// If there is some i such that $h(V_i, v)=0$ then returns DB_KEYEXIST.
// Otherwise, let i be the minimum value such that $h(V_i, v)>0$.
// If no such i exists, then let i be |V|
// Then this has the same effect as
// omt_insert_at(tree, value, i);
// If index!=NULL then i is stored in *index
// Requires: The signum of h must be monotonically increasing.
// Returns:
// 0 success
// DB_KEYEXIST the key is present (h was equal to zero for some value)
// ENOMEM
// On nonzero return, omt is unchanged.
// On nonzero non-DB_KEYEXIST return, *index is unchanged.
// Performance: time=O(\log N) amortized.
// Rationale: Some future implementation may be O(\log N) worst-case time, but O(\log N) amortized is good enough for now.
int toku_omt_delete_at(OMT omt, u_int32_t idx);
// Effect: Delete the item in slot index.
// Decreases indexes of all items at slot >= index by 1.
// Returns
// 0 success
// EINVAL if index>=toku_omt_size(omt)
// On error, omt is unchanged.
// Rationale: To delete an item, first find its index using toku_omt_find, then delete it.
// Performance: time=O(\log N) amortized.
void toku_omt_cursor_set_index(OMTCURSOR c, u_int32_t idx);
// Effect:
// Set the abstract index.
// Requires:
// The cursor is not invalid.
int toku_omt_fetch (OMT V, u_int32_t i, OMTVALUE *v, OMTCURSOR c);
// Effect: Set *v=V_i
// If c!=NULL then set c's abstract offset to i.
// Requires: v != NULL
// Returns
// 0 success
// EINVAL if index>=toku_omt_size(omt)
// On nonzero return, *v is unchanged, and c (if nonnull) is either
// invalidated or unchanged.
// Performance: time=O(\log N)
// Implementation Notes: It is possible that c was previously valid and was
// associated with a different OMT. If c is changed by this
// function, the function must remove c's association with the old
// OMT, and associate it with the new OMT.
int toku_omt_find_zero(OMT V, int (*h)(OMTVALUE, void*extra), void*extra, OMTVALUE *value, u_int32_t *idx, OMTCURSOR c);
// Effect: Find the smallest i such that h(V_i, extra)>=0
// If there is such an i and h(V_i,extra)==0 then set *index=i and return 0.
// If there is such an i and h(V_i,extra)>0 then set *index=i and return DB_NOTFOUND.
// If there is no such i then set *index=toku_omt_size(V) and return DB_NOTFOUND.
// Requires: index!=NULL
int toku_omt_find(OMT V, int (*h)(OMTVALUE, void*extra), void*extra, int direction, OMTVALUE *value, u_int32_t *idx, OMTCURSOR c);
// Effect:
// If direction >0 then find the smallest i such that h(V_i,extra)>0.
// If direction <0 then find the largest i such that h(V_i,extra)<0.
// (Direction may not be equal to zero.)
// If value!=NULL then store V_i in *value
// If index!=NULL then store i in *index.
// Requires: The signum of h is monotically increasing.
// Returns
// 0 success
// DB_NOTFOUND no such value is found.
// On nonzero return, *value and *index are unchanged, and c (if nonnull) is either invalidated
// or unchanged.
// Performance: time=O(\log N)
// Rationale:
// Here's how to use the find function to find various things
// Cases for find:
// find first value: ( h(v)=+1, direction=+1 )
// find last value ( h(v)=-1, direction=-1 )
// find first X ( h(v)=(v< x) ? -1 : 1 direction=+1 )
// find last X ( h(v)=(v<=x) ? -1 : 1 direction=-1 )
// find X or successor to X ( same as find first X. )
//
// Rationale: To help understand heaviside functions and behavor of find:
// There are 7 kinds of heaviside functions.
// The signus of the h must be monotonically increasing.
// Given a function of the following form, A is the element
// returned for direction>0, B is the element returned
// for direction<0, C is the element returned for
// direction==0 (see find_zero) (with a return of 0), and D is the element
// returned for direction==0 (see find_zero) with a return of DB_NOTFOUND.
// If any of A, B, or C are not found, then asking for the
// associated direction will return DB_NOTFOUND.
// See find_zero for more information.
//
// Let the following represent the signus of the heaviside function.
//
// -...-
// A
// D
//
// +...+
// B
// D
//
// 0...0
// C
//
// -...-0...0
// AC
//
// 0...0+...+
// C B
//
// -...-+...+
// AB
// D
//
// -...-0...0+...+
// AC B
int toku_omt_split_at(OMT omt, OMT *newomt, u_int32_t idx);
// Effect: Create a new OMT, storing it in *newomt.
// The values to the right of index (starting at index) are moved to *newomt.
// Requires: omt != NULL
// Requires: newomt != NULL
// Returns
// 0 success,
// EINVAL if index > toku_omt_size(omt)
// ENOMEM
// On nonzero return, omt and *newomt are unmodified.
// Performance: time=O(n)
// Rationale: We don't need a split-evenly operation. We need to split items so that their total sizes
// are even, and other similar splitting criteria. It's easy to split evenly by calling toku_omt_size(), and dividing by two.
int toku_omt_merge(OMT leftomt, OMT rightomt, OMT *newomt);
// Effect: Appends leftomt and rightomt to produce a new omt.
// Sets *newomt to the new omt.
// On success, leftomt and rightomt destroyed,.
// Returns 0 on success
// ENOMEM on out of memory.
// On error, nothing is modified.
// Performance: time=O(n) is acceptable, but one can imagine implementations that are O(\log n) worst-case.
void toku_omt_clear(OMT omt);
// Effect: Set the tree to be empty.
// Note: Will not reallocate or resize any memory, since returning void precludes calling malloc.
// Performance: time=O(1)
unsigned long toku_omt_memory_size (OMT omt);
// Effect: Return the size (in bytes) of the omt, as it resides in main memory. Don't include any of the OMTVALUES.
int toku_omt_cursor_create (OMTCURSOR *p);
// Effect: Create an OMTCURSOR. Stores it in *p. The OMTCURSOR is
// initially invalid.
// Requires: p != NULL
// Returns:
// 0 success
// ENOMEM out of memory (and doesn't modify *omtp)
// Performance: constant time.
void toku_omt_cursor_destroy (OMTCURSOR *p);
// Effect: Invalidates *p (if it is valid) and frees any memory
// associated with *p.
// Also sets *p=NULL.
// Rationale: The usage is to do something like
// toku_omt_cursor_destroy(&c);
// and now c will have a NULL pointer instead of a dangling freed pointer.
// Rationale: Returns no values since free() cannot fail.
int toku_omt_cursor_is_valid (OMTCURSOR c);
// Effect: returns 0 iff c is invalid.
// Performance: time=O(1)
int toku_omt_cursor_next (OMTCURSOR c, OMTVALUE *v);
// Effect: Increment c's offset, and find and store the value in v.
// Requires: v != NULL
// Returns
// 0 success
// EINVAL if the offset goes out of range or c is invalid.
// On nonzero return, *v is unchanged and c is invalidated.
// Performance: time=O(log N) worst case, expected time=O(1) for a randomly
// chosen initial position.
int toku_omt_cursor_current_index(OMTCURSOR c, u_int32_t *idx);
// Effect: Stores c's offset in *index.
// Requires: index != NULL
// Returns
// 0 success
// EINVAL c is invalid
// On nonzero return, *index is unchanged and c is unchanged.
// Performance: time=O(1)
OMT toku_omt_cursor_get_omt(OMTCURSOR c);
// Effect: returns the associated omt or NULL if not associated.
// Performance: time=O(1)
int toku_omt_cursor_current (OMTCURSOR c, OMTVALUE *v);
// Effect: Store in v the value pointed by c's abstract offset
// Requires: v != NULL
// Returns
// 0 success
// EINVAL if c is invalid
// On non-zero return, *v is unchanged
// Performance: O(1) time
int toku_omt_cursor_prev (OMTCURSOR c, OMTVALUE *v);
// Effect: Decrement c's offset, and find and store the value in v.
// Requires: v != NULL
// Returns
// 0 success
// EINVAL if the offset goes out of range or c is invalid.
// On nonzero return, *v is unchanged and c is invalidated.
// Performance: time=O(log N) worst case, expected time=O(1) for a randomly
// chosen initial position.
void toku_omt_cursor_invalidate (OMTCURSOR c);
// Effect: Invalidate c. (This does not mean that c is destroyed or
// that its memory is freed.)
// If c is valid, the invalidate callback function (if any) will be called
// before invalidating c.
void toku_omt_cursor_set_invalidate_callback(OMTCURSOR c, void (*f)(OMTCURSOR,void*), void* extra);
// Effect:
// Saves function 'f' to be called whenever the cursor is invalidated.
// 'extra' is passed as an additional parameter to f.
// Requires:
// The lifetime of the 'extra' parameter must continue at least till the cursor
// is destroyed.
void toku_omt_cursor_associate(OMT omt, OMTCURSOR c);
// Effect:
// Associates an omtcursor with an omt.
// Requires:
// The omtcursor is not associated with any other omt.
// Requires:
// toku_omt_associate must be called when the omt-lock is held
// Rationale:
// This is used by brt_cursors for omts representing leaf nodes.
// These omts are touched by multiple threads, and therefore require locks
// for modifying the list of omtcursors.
// We do not want to grab two locks (one for omt, and one for the old
// associated omt).
#endif /* #ifndef OMT_H */