mirror of
https://github.com/MariaDB/server.git
synced 2025-01-22 23:04:20 +01:00
95 lines
4.2 KiB
Text
95 lines
4.2 KiB
Text
|
@node Introduction to TokuDB
|
||
|
@chapter Introduction to TokuDB
|
||
|
|
||
|
TokuDB is an embedded transactional database that provides very fast
|
||
|
insertions and range queries on dictionaries.
|
||
|
|
||
|
TokUDB provides transactions with full ACID properties.
|
||
|
|
||
|
TokuDB is a library that can be linked directly into applications,
|
||
|
making it an @dfn{embedded} database.
|
||
|
|
||
|
TokuDB provides an API that is very similar to the Berkeley@tie{}DB.
|
||
|
|
||
|
@node Dictionaries and Associative Memories
|
||
|
@section Dictionaries and Associative Memories
|
||
|
|
||
|
Here we describe what we mean by ``dictionary''.
|
||
|
|
||
|
A @dfn{dictionary} is an ordered set of key-data pairs.
|
||
|
@itemize @bullet
|
||
|
@item
|
||
|
An @dfn{insertion} stores a key-data pair into a dictionary.
|
||
|
(One of the API calls that inserts pairs is called @code{DB->put}.)
|
||
|
|
||
|
@item
|
||
|
A @dfn{point query} finds the data associated with a particular key.
|
||
|
(One of the API calls that executes a point query is called @code{DB->get}.)
|
||
|
|
||
|
@item
|
||
|
A @dfn{range query} iterates through all the key-value pairs in a
|
||
|
range. For example, in a phone book, you could step through all the
|
||
|
people whose last names are between @samp{Jones} and @samp{Smith}. In
|
||
|
the API, range queries are supported using cursors.
|
||
|
@end itemize
|
||
|
|
||
|
An @dfn{associative memory} is like a dictionary but it supports only
|
||
|
insertions and point queries, but not range queries. Often hash
|
||
|
tables are used to implement associative memories. (In some
|
||
|
languages, such as Perl, an ``dictionary'' is really an associative
|
||
|
memory.)
|
||
|
|
||
|
@node Asymptotic Performance of Fractal Trees
|
||
|
@section Asymptotic Performance of Fractal Trees
|
||
|
|
||
|
For in-memory dictionaries, trees are often the data structure of
|
||
|
choice. For example, red-black trees or AVL trees implement all of
|
||
|
the above operations in @math{O(\log N)} time for in-memory data
|
||
|
structures, where $math{N} is the number of entries in the dictionary
|
||
|
(assuming constant-size key-data pairs).
|
||
|
|
||
|
For disk-resident dictionaries, most systems (such as
|
||
|
Berkeley@tie{}DB) employ B-trees, which can implement insertions and
|
||
|
point queries with @math{O(\log_B N)} disk operations in the worst
|
||
|
case, where @math{B} is the block size of the B-tree (assuming
|
||
|
constant-size key-data pairs).
|
||
|
|
||
|
The @dfn{effective block size} of a disk is the size at which the
|
||
|
disk-head movement does not dominate the cost of storing or retrieving
|
||
|
data off the disk. The effective block size is difficult to calculate
|
||
|
for actual disk systems, since it depends on how far the head must
|
||
|
move (the disk head moves short distances faster than long distances,
|
||
|
reducing the effective block size), whether the data is near the
|
||
|
center of the disk (where the bandwidth is small and hence the
|
||
|
effective block size tends to be small) or at the outer edge of the
|
||
|
disk (where the bandwidth is larger), and on how much prefetching the
|
||
|
operating system and disk firmware perform.
|
||
|
|
||
|
TokuDB, on the other hand, employs @dfn{fractal trees}, a new class of
|
||
|
data structures that provide the same API as B-trees, but are much
|
||
|
faster. Fractal tree point-queries have the same asymptotic cost as
|
||
|
B-trees (@math{O(\log_B N)}), but insertions require only
|
||
|
@math{O((\log N)/B)} disk operations on average, where @math{B} is the
|
||
|
effective block size of the disk system.
|
||
|
|
||
|
How much faster is that in concrete terms? In 2007, the effective
|
||
|
block size of a disk is about a megabyte. If you use 100-byte
|
||
|
key-data pairs, then you can fit about 10,000 pairs into an effective
|
||
|
block. If you have a 100 Terabyte database then @math{N=2^{40}}, so
|
||
|
on average a fractal tree requires about @math{(\log N)/B = 40/10000 =
|
||
|
1/250} of a disk transfer per insertion. That is, on average, you can
|
||
|
perform 250 insertions per disk I/O. In contrast, a traditional
|
||
|
B-tree equires @math{\log_B N = 2} disk I/O's per insertion. In
|
||
|
practice, because of caching in main memory, traditional B-trees
|
||
|
require 1 disk read and 1 disk write per insertion. Thus TokuDB's
|
||
|
insertions are orders of magnitude faster than dictionaries based on
|
||
|
traditional B-trees.
|
||
|
|
||
|
Range queries are also faster for TokuDB than in a typical B-tree.
|
||
|
Since B-trees usually use blocks sizes that relatively small compared
|
||
|
to the effective block size of the disk they require many disk-head
|
||
|
movements to traverse a range. Fractal trees use block sizes that are
|
||
|
the effective block size of the disk.
|
||
|
|
||
|
|