Overview
DBM (Database Manager) is a concept of libraries to store an associative array on a permanent storage. In other words, DBM allows an application program to store key-value pairs in a file and reuse them later. Each of keys and values is a string or a sequence of bytes. The key of each record must be unique within the database and a value is associated to it. You can retrieve a stored record with its key very quickly. Thanks to simple structure of DBM, its performance can be extremely high.
Featured Content Ads
add advertising hereTkrzw is a C++ library implementing DBM with various algorithms. It features high degrees of performance, concurrency, scalability and durability. The following classes are provided.
- HashDBM : File database manager implementation based on hash table.
- TreeDBM : File database manager implementation based on B+ tree.
- SkipDBM : File database manager implementation based on skip list.
- TinyDBM : On-memory database manager implementation based on hash table.
- BabyDBM : On-memory database manager implementation based on B+ tree.
- CacheDBM : On-memory database manager implementation with LRU deletion.
- Std(Hash|Tree)DBM : On-memory DBM implementations using std::unordered_map and std::map.
- (Poly|Shard)DBM : Polymorphic and sharding database manager adapters.
- AsyncDBM : Asynchronous database manager adapter.
- (File|Mem)Index : Secondary index implementations.
All database classes share the same interface so that applications can use any of them with the common API. All classes are thread-safe so that multiple threads can access the same database simultaneously. Basically, the database file is opened with the “Open” method and closed with the “Close” method. You can store records with the “Set” method, retrieve records with the “Get” method, and remove records with the “Remove” method. Iterator is also supported to retrieve each and every record of the database.
Transactional features are also supported. If you want to evaluate a record and update it atomically, you use the “Process” method, which takes a key and an arbitrary callback function to process the record. If you want to evaluate multiple records and update them atomically, you use the “ProcessMulti” method, which takes keys and arbitrary callback functions to process the records. If you want to do long transactions without locking records, the “CompareExchange” method and the “CompareExchangeMulti” method are useful.
Each data class has different approaches for durability. The file database classes support the auto restore feature to recover records after the process crashes. The file hash/tree databases support the appending update mode, which prevents data corruption of existing records. The file skip database and all on-memory databases atomically replace the entire database with a completed temporary file, which realizes the strongest durability on a single machine. All database classes support producing online update logs, which enables online data replication and staged backup.
Featured Content Ads
add advertising hereIf the data structure is more complex than key-value pairs, you can treat the value as a serialized text/binary of any data structure. You can use any serialization format. To look up records by some properties except for the primary key, you can use secondary indices. MemIndex and FileIndex classes are useful for the purpose.
Tkrzw adopts dependency injection for file implementations. Bundled implementations support memory mapping I/O, normal read/write I/O, and direct block I/O. You can inject your own implementations to support specific data storages and file systems. Whereas such file implementations use APIs (system calls) of the underlying operating systems, the upper layer for database algorithms doesn’t depend on them so that the maximum portability is achieved. Tkrzw supports any Unix-like systems and Windows.
You can use some bridging interfaces to use Tkrzw in other programming languages than C++. Currently, C, Java, Python, Ruby, and Go interfaces are provided. The C interface is bundled in the main package. The other interfaces are packaged separately.
Command line utilities are provided as well. Thereby, you can create databases, set records, retrieve records, remove records, rebuild databases, and restore databases. A tool to do performance tests is also bundled.
Featured Content Ads
add advertising hereIf you want to share a database among multiple processes or programs on remote machines, use the database service of Tkrzw-RPC.
Download
You can download source packages in the following directories.
- C++ source packages
- Java source packages
- Python source packages
- Ruby source packages
- Go source packages
Documents
The following are API documents for each language.
Installation
Tkrzw is implemented based on the C++17 standard and POSIX/Windows API. On a Unix-like system (Linux, FreeBSD, and Mac OS X), GCC (version 7.3 or later) is required. On a Windows system, Visual C++ (2019 or later) is required.
For UNIX
To build and install the library, usually, you will run the following commands.
“make check” is just for testing and it can take several minutes. You can omit it. If you build binary packages on resource-limited systems, use “make check-light” instead. If you want to install the files under “/usr”, specify “–prefix=/usr” with the configure script.
By default, the library and related files are installed under “/usr/local”.
Let’s check whether the command utilities work properly.
Read the C++ API documents for usage of the library. Then, let’s build a sample program. Make a file of the following C++ code and save it as “helloworld.cc”.
To build an application program, you’ll typically run a command like this. The compiler flag “-std=c++17” is necessary if the default C++ version of your compiler is older than C++17. The flags “-llzma”, “-llz4”, “-lzstd”, and “-lz” should be set only if corresponding features (LZMA, LZ4, ZStd, and ZLib) are enabled by the configure script. If you don’t use threading functions in the application code, you can omit the compiler flag “-pthread”. Even so, the linking flags “-latomic” and “-lpthread” are necessary because the library uses threading functions.
The bundled command “tkrzw_build_util” is useful to know necessary CPPFLAGS (flags for the preprocessor), and LDFLAGS (flags for the linker).
In some environments, you can also use the “pkg-config” command to list up those flags.
To run the test suite, GoogleTest is required. Although testing the library is not necessary for most users, you can do so just in case.
The “–enable-opt-native” flag of the configure script is to optimize the operation code specifically to the native platform. If it is omitted, normal optimization is done. The “–enable-most-features” flag is to enable as many features as possible on the platform. Data compression features using external libraries are integrated. You can enable each of them separately by setting “–enable-zlib”, “–enable-zstd”, “–enable-lz4”, and “–enable-lzma”. If you use another prefix than “/usr/local” for extra packages, specify the prefix with “–with-extra”, such as “–with-extra=/opt/homebrew”. If you don’t need dynamic linking library, set “–disable-shared”. If you want the utility commands with static linking, set “–enable-static”. If you develop Tkrzw itself, set “–enable-devel” or “–enable-debug”.
For Windows
To build and install the library, you use “nmake”. Edit VCMakefile to assure the base path of Visual C++ and the base path of the system SDK are correct. The directory names are different among minor versions so checking them is important.
Open “x64 Native Tools Command Prompt for VS 2019” in the control panel. It sets command paths and other necessary environment configurations. Set the current directory to Tkrzw’s source directory. Run the following command.
nmake -f VCMakefile
> nmake -f VCMakefile check
]]>
To install the library, open another command prompt of VC++ as Administrator and run the following command.
nmake -f VCMakefile install
]]>
By default, the library and related files are installed under “C:Program Filestkrzw”. You can change it by editing VCMakefile before installation.
Read the C++ API documents for usage of the library. Then, let’s build a sample program. Make a file of the following C++ code and save it as “helloworld.cc”.
To build an application program, you’ll typically run a command like this. The compiler flag “/std:c++17” is necessary if the default C++ version of your compiler is older than C++17.
cl /std:c++17 /Zc:__cplusplus
/I "C:Program Files (x86)tkrzwinclude"
/O2 /EHsc /W3 /MT helloworld.cc tkrzw.lib /OUT:helloworld.exe
/link /libpath:"C:Program Files (x86)tkrzwlib"
]]>
The bundled command “tkrzw_build_util” is useful to know necessary CPPFLAGS (flags for the preprocessor), and LDFLAGS (flags for the linker).
tkrzw_build_util config -i
/I "C:Program Filestkrzwinclude"
> tkrzw_build_util config -l
/libpath:"C:Program Filestkrzwlib" tkrzw.lib
]]>
By default, the compiler flag “/MT” is specified to link to the multithread, static version of the run-time library. You can change it to “/MD” link to the dynamic run-time library. You can also make a DLL for TKRZW by the linker option “/LD”.
The data compression libraries are not integrated on Windows by default. If you want to enable some of them, define macros by adding some of “/D_TKRZW_COMP_ZLIB”, “/D_TKRZW_COMP_ZSTD” , “/D_TKRZW_COMP_LZ4”, and “/D_TKRZW_COMP_LZMA” to CLFLAGS in VCMakefile. Then, add the library (lib/dll) files to EXTRALIBRARIES in VCMakefile. You’ll also add the library files to the command line to build your own programs.
Performance
Each database class has specific traits. It is important to select the best one or combine some of them for the use case.
class | medium | algorithm | ordered access |
time complexity |
update method |
threading mutex |
typical footprint |
---|---|---|---|---|---|---|---|
HashDBM | File | Hash table | No | O(1) | Online | Record RW-lock | 14 bytes/rec |
TreeDBM | File | B+ tree | Yes | O(log N) | Online | Page RW-lock | 3 bytes/rec |
SkipDBM | File | Skip list | Yes | O(log N) | Batch | Whole RW-lock | 4 bytes/rec |
TinyDBM | RAM | Hash table | No | O(1) | Online | Record RW-lock | – |
BabyDBM | RAM | B+ tree | Yes | O(log N) | Online | Page RW-lock | – |
CacheDBM | RAM | Hash table | No | O(1) | Online | Slot RW-lock | – |
StdHashDBM | RAM | Hash table | No | O(1) | Online | Whole RW-lock | – |
StdTreeDBM | RAM | Red-black tree | Yes | O(log N) | Online | Whole RW-lock | – |
Performance is also different among the database classes. Let’s say, one thread sets 1 million records. The key of each record is an 8-byte string from “00000000”, “00000001”, to “00999999” in ascending order. The value is an 8-byte string. Then, it retrieves all records and finally remove them. The following table shows throughput of all operations on a computer with a 3.5Ghz 6-core CPU.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,194,855 QPS | 2,544,004 QPS | 1,918,255 QPS | 28.9 MB | 26.8 MB |
TreeDBM | 1,799,118 QPS | 1,897,519 QPS | 1,654,547 QPS | 58.5 MB | 17.8 MB |
SkipDBM | 1,252,369 QPS | 448,360 QPS | 1,299,623 QPS | 78.3 MB | 19.3 MB |
TinyDBM | 3,000,138 QPS | 3,718,786 QPS | 3,473,247 QPS | 52.8 MB | n/a |
BabyDBM | 2,580,311 QPS | 2,734,840 QPS | 2,362,725 QPS | 40.9 MB | n/a |
CacheDBM | 2,940,138 QPS | 3,880,209 QPS | 3,633,801 QPS | 71.3 MB | n/a |
StdHashDBM | 1,799,697 QPS | 2,987,751 QPS | 2,876,639 QPS | 99.9 MB | n/a |
StdTreeDBM | 1,489,055 QPS | 3,716,134 QPS | 3,489,438 QPS | 107.0 MB | n/a |
Next, we use 10 threads, each of which sets 100 thousand records. Thus, 1 million records in total are set. Then, the threads retrieve all records and finally remove them. As seen below, the on-memory hash database is extremely good at concurrent updating. The file hash database is also very good at concurrent updating although it is based on the file storage.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,235,611 QPS | 4,805,936 QPS | 2,690,573 QPS | 29.3 MB | 26.8 MB |
TreeDBM | 1,802,679 QPS | 4,085,953 QPS | 1,805,765 QPS | 79.4 MB | 19.0 MB |
SkipDBM | 914,857 QPS | 1,897,854 QPS | 1,000,715 QPS | 86.8 MB | 19.3 MB |
TinyDBM | 6,649,593 QPS | 10,278,645 QPS | 8,986,748 QPS | 55.5 MB | n/a |
BabyDBM | 3,195,461 QPS | 11,635,750 QPS | 3,436,732 QPS | 47.4 MB | n/a |
CacheDBM | 8,467,970 QPS | 12,087,366 QPS | 11,949,141 QPS | 72.0 MB | n/a |
StdHashDBM | 1,167,830 QPS | 15,775,355 QPS | 1,784,857 QPS | 100.0 MB | n/a |
StdTreeDBM | 1,124,624 QPS | 16,023,594 QPS | 2,130,565 QPS | 107.2 MB | n/a |
The above test cases should show ideal performance of each database class because records are accessed sequentially and the entire data can be cached on memory by the file system. Then, let’s move on to tougher test cases. We build a large database of 100 million records with random keys between “00000000” and “99999999”. As some keys are duplicated and such records are overwritten, the actual number of unique records is about 63 million.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 1,312,484 QPS | 2,052,518 QPS | 1,759,782 QPS | 1788.1 MB | 1828.2 MB |
TreeDBM | 191,631 QPS | 148,509 QPS | 389,659 QPS | 4012.0 MB | 1714.3 MB |
SkipDBM | 454,955 QPS | 213,812 QPS | 453,710 QPS | 3036.7 MB | 1225.7 MB |
TinyDBM | 2,534,238 QPS | 2,744,532 QPS | 3,083,089 QPS | 3573.5 MB | n/a |
BabyDBM | 518,104 QPS | 517,197 QPS | 514,992 QPS | 2605.4 MB | n/a |
CacheDBM | 2,351,585 QPS | 2,375,056 QPS | 3,051,809 QPS | 4753.8 MB | n/a |
StdHashDBM | 1,912,922 QPS | 2,443,481 QPS | 2,850,411 QPS | 6410.1 MB | n/a |
StdTreeDBM | 482,476 QPS | 530,598 QPS | 522,506 QPS | 6596.3 MB | n/a |
Finally, we do the same operations with 10 threads. These test cases aim to simulate typical use cases where a large amount of random records are inserted and retrieved simultaneously. As suggested by the time complexity O(1) in theory, actual throughputs of the file hash database and the on-memory hash database are not affected by access patterns or the number of records. Meanwhile, throughputs of the file tree database and the file skip database decline although they can still respond to hundreds of thousands of queries per second.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,558,225 QPS | 6,041,697 QPS | 4,122,059 QPS | 1788.3 MB | 1828.2 MB |
TreeDBM | 343,564 QPS | 562,983 QPS | 568,628 QPS | 5195.4 MB | 1722.2 MB |
SkipDBM | 393,086 QPS | 1,384,863 QPS | 418,209 QPS | 3036.8 MB | 1225.7 MB |
TinyDBM | 8,351,605 QPS | 9,880,235 QPS | 9,272,530 QPS | 3573.4 MB | n/a |
BabyDBM | 1,239,584 QPS | 3,754,941 QPS | 1,187,900 QPS | 2645.8 MB | n/a |
CacheDBM | 9,314,902 QPS | 9,711,285 QPS | 11,446,497 QPS | 4754.0 MB | n/a |
StdHashDBM | 1,279,515 QPS | 13,572,641 QPS | 1,794,882 QPS | 6410.1 MB | n/a |
StdTreeDBM | 374,263 QPS | 4,118,937 QPS | 407,559 QPS | 6596.1 MB | n/a |
In general, if you want a key-value storage with the highest performance, choosing the file hash database is recommended. If you need ordered access of records, choosing the file tree database is recommended. If you need scalability of ordered databases, choosing the file skip database is recommended. If you need extreme performance, the on-memory hash database and the on-memory tree database are useful although they require another way to save/load records. If you want a cache system with deletion of LRU records, the cache database is useful.
HashDBM: The File Hash Database
The file hash database stores key-value structure in a single file. It uses a hash table and linked lists of records from buckets. Therefore, given the number of records N and the number of buckets M, the average time complexity of data retrieval is O(N/M). If M is large enough, the time complexity can be said O(1).
Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each hash bucket. Therefore, even writer threads which can set or remove records performs in parallel. Blocking is done only when multiple writers try to update the same record simultaneously. Reader threads which retrieve records don’t block each other even for the same record.
Durability is also pursued. Updating operations support two modes: in-place and appending. In the in-place mode, the region of an existing record in the file is re-written to modify the value of the record. In the appending mode, the region in the file for an existing record is never re-written. Instead, new data to overwrite the value is appended at the end of the file. The new data is put at the top of the linked list for the record so that it is detected prior to the old one. In both modes, even if the hash table is broken for some reason, the database is restored automatically to salvage record data. In the in-place mode, the record being updated when brokage happens can be lost but the other records can be recovered. In the appending mode, any record can never be lost unless the file system itself breaks.
The hash table is static and not reorganized implicitly. Thus, if the load factor of the hash table is high, the database should be rebuilt explicitly by the application. Rebuilding the database is effective to resolve fragmentation which can happen in both the in-place mode and the appending mode. Rebuilding the database doesn’t block other database operations. In other words, you can retrieve and update records while the database is being rebuilt.
See the tips of Tuning HashDBM too. Tuning has a significant influence on performance. For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->First();
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
std::cout Next();
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
This is a code example which represents a more serious use case with performance tuning and thorough error checks.
iter = dbm.MakeIterator();
if (iter->First() != Status::SUCCESS) {
// Failure of the First operation is critical so we stop.
Die("First failed: ", status);
}
while (true) {
// Retrieves the current record data.
std::string iter_key, iter_value;
status = iter->Get(&iter_key, &iter_value);
if (status == Status::SUCCESS) {
std::cout Next();
if (status != Status::SUCCESS) {
// This could happen if another thread removed the current record.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
std::cerr
The file hash database, as well as other database classes, supports call-back functions to process the value of a record. The operation is done while the record is locked so that the update looks done atomically.
Format
The hash database file is composed of five sections: the metadata section, the bucket section, the free block pool section, the record header section, and the record section.
The metadata section dominates the first 128 bytes of the file. It contains the following fields.
- Magic data
- From the offset 0. A 9-byte string “TkrzwHDBn”.
- The front cyclic magic data.
- From the offset 9. A 1-byte integer.
- The package major version
- From the offset 10. A 1-byte integer.
- The package minor version
- From the offset 11. A 1-byte integer.
- The static flags
- From the offset 12. A 1-byte integer.
- The offset width
- From the offset 13. A 1-byte integer.
- The alignment power
- From the offset 14. A 1-byte integer.
- The closure flags
- From the offset 15. A 1-byte integer.
- The number of buckets
- From the offset 16. An 8-byte big-endian integer.
- The number of records
- From the offset 24. An 8-byte big-endian integer.
- The effective data size.
- From the offset 32. An 8-byte big-endian integer.
- The file size
- From the offset 40. An 8-byte big-endian integer.
- The last modified timestamp.
- From the offset 48. An 8-byte big-endian integer.
- The database type
- From the offset 56. A 4-byte big-endian integer.
- The opaque data
- From the offset 62. Arbitrary string data.
- The back cyclic magic data.
- From the offset 127. A 1-byte integer.
The magic data and the package version data are used for identifying the kind of the file. The two figures of the cyclic magic data are used to check consistency of the meta data. The version data indicates the version of the Tkrzw package when the file is created.
The static flags specifies flags which are not changed during lifetime of the database. The first bit and the second bit represent the update mode. 0x1 means the in-place update mode. 0x2 means the appending update mode. 0x0 and 0x3 are invalid. The third bit and the fourth bit represent the record CRC mode. 0x0 means no CRC. 0x1 means CRC-8. 0x2 means CRC-16. 0x3 means CRC-32. The fifth bit, the sixth, and the seventh bit represent compression mode. 0x0 means no compression. 0x1 means ZLib. 0x2 means ZStd. 0x3 means LZ4. 0x4 means LZMA.
The offset width specifies how many bytes are used to store an offset value. Given the offset width W, the maximum value is 2^(W*8). So, W=3 sets the maximum value 16,777,216 and W=4 sets it 4,294,967,296. The offset width affects the size of the buckets and the footprint of each record. The alignment power specifies the alignment of the offset value. Given the alignment power P, the alignment is 2^P. So, P=2 sets the alignment 4 and P=3 sets it 8. The alignment affects the size of space for each record. The maximum database size is determined as 2^(W*8+P). The default value of the offset width is 4 and the default value of the alignment power is 3. So, the maximum database size is 32GiB by default.
When a database is opened in the writable mode, the metadata section is updated immediately to set the closure flag zero and set the last modified timestamp to the current UNIX time in microseconds. If the process crushes without closing the database, the flag and the timestamp helps us detect the incident. If the writable database is closed normally, the closure flag is set one and the last modified timestamp is updated too.
The number of buckets determines how many buckets are used for the hash table. The number of records indicates the current number of living key-value pairs in the database. The effective data size indicates the total amount of bytes used for the key and the value data of the living records. In other words, it doesn’t include data for removed data, overwritten data, or other footprints. This figure is used to calculate storage efficiency.
The database type and the opaque data are used for any purposes by the application. Modifying them is done in place so it doesn’t increase the file size even in the appending mode. The file tree database uses the opaque data to store its metadata.
The bucket section dominates from the offset 128 to an offset determined by the number of buckets and the offset width. Given the number of buckets M and the offset width W, the end offset is 128 + M * W. The default value of the number of buckets is 1,048,583. Then, the bucket section dominates from the offset 128 to 4,194,460. Each bucket is an integer in big-endian. If there’s no record matching the bucket index, the value is zero. Otherwise, it indicates the offset of the data of the first record in a linked list of records matching the bucket index.
The hash function is the 64-bit version of MurmurHash 3. It is applied to the whole key data. If the number of buckets is less than 2^32, the hash value is folded by taking XOR of the greater 32 bits and the lower 32 bits in the format of (([32,47]<<16)|[48,63]) ^ (([0,15]<<16)|[16,31]). The bucket index of the key is the remainder of division of the hash value by the number of buckets.
The record section dominates from an aligned offset after the bucket section to the end of the file. The start point is equal to or after the figure calculated as 128 + num_buckets * W + 1024 and it is aligned to 4096 and 2^P. The record section contains multiple record data in sequence. Each record is serialized in the following format.
- The magic data
- 1 byte integer.
- The child offset
- A big-endian integer with the offset width.
- The key size
- A byte delta encoded integer.
- The value size
- A byte delta encoded integer.
- The padding size
- A byte delta encoded integer.
- The CRC value
- Optional: A 1-byte, 2-byte, or 4-byte integer.
- The key data
- Optional: Arbitrary string data.
- The value data
- Optional: Arbitrary string data.
- The padding data
- Optional: A magic data and null data.
The magic data is composed of the operation type part and the checksum part. The upper 2 bits are the operation type part. 0xC0 (0x3 << 6) indicates that the operation is to do nothing. 0x80 (0x2 << 6) is to set the value of an existing record. 0x40 (0x1 << 6) is to remove an existing record. 0x00 (0x1 << 6) is to add a new record. The lower 6 bits are the checksum part. The checksum is a total value of the seed value 11 and each byte of a concatenated sequence of the record key and the record value in modulo 61. And, the final value is added to 3 so that the range is between 3 and 63 inclusive. The checksum helps us detect data corruption.
The child offset indicates the offset of the next record data in the linked list of the bucket. When a new record data of the same bucket is added, the bucket points to the new record data and the new record data points to the top one of the existing linked list, or zero if it doesn’t exist.
The key size, the value size, and the padding size are represented in byte delta encoding. A value between 0 and 127 takes 1 byte. A value between 128 and 16,383 takes 2 bytes. A value between 16,384 and 2,097,151 takes 3 bytes. A value between 268,435,456 and 34,359,738,367 takes 4 bytes.
The CRC value is optional. If it exists, the value is the result of CRC-8, CRC-16, or CRC-32 applied to the concatenated data of the key and the value. The CRC value is used to detect inconsistency when the database is restored.
The key data and the value data are optional, which means an empty key and an empty value respectively. The key data is stored as-is. The value data is also stored as-is by default. If the record compression is enabled, compressed data by the algorithm is stored.
The padding data is optional. If it exists, it begins with a byte of 0xDD. The remaining data is null codes. The actual padding size includes additional size of the metadata for the padding size. That is, if the padding size in the record metadata is more than 127, the actual padding size is less by one byte. Likewise, larger than 16383 means less by two bytes, larger than 2097151 means less by three bytes.
With the default 4-byte offset width and small-sized (less than 128 bytes) keys and medium-sized (less than 16384 bytes) values, the footprint for each record is 1 + 4 + 1 + 2 + 1 = 9 bytes. However, due to alignment, padding bytes can be added. With the default 8-byte alignment, average padding size is about 4 bytes.
The 16 bytes before the record section is the record header section. It is composed of “TkrzwRECn”, a 1-byte integer for the static flags, a 1-byte integer for the offset width, and a 1-byte integer for the alignment power. They would be used to recover records if the main header were broken.
The 1008 bytes before the record header section is the free block pool section. It contains pairs of an offset and a size of each free block. The offset is a big-endian integer of the offset width. The size is a big-endian integer of 4 bytes. The maximum number of pairs is determined as 1008 / (W + 4). Given the offset width 4, the maximum number is 126.
As long as the meta data for the file size indicates the actual content size, padding data of null codes can be appended at the end of the file. This is useful to align the file size to the requirement of direct I/O.
TreeDBM: The File Tree Database
The file tree database stores key-value structure in a single file. It uses a multiway balanced tree structure called B+ tree. Therefore, given the number of records N, the average time complexity of data retrieval is O(log N). Because records are ordered by the key, range searches including forward matching search are supported.
A B+ tree is composed of nodes which are composed of records and links. Each node is serialized as a “page”, which are stored in the file hash database. Therefore, the file tree database inherits many characteristics of the file hash database: performance, scalability, and durability. The file tree database uses a double-layered LRU cache system to reduce frequency of inputs and outputs of pages; Pages which have been repeatedly accessed are stored in the “hot” cache so that they are not wiped out even if many pages are accessed by a few traversal operations, which use the “warm” cache.
Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each page. Therefore, if threads randomly access records, they don’t block each other. If multiple threads access the same page, a writer blocks the others but readers don’t block other readers. The page cache system is managed with slotted mutexes so that multiple threads can access the cache system simultaneously.
If records are accessed sequentially in order of the key, performance of the file tree database can be better than the file hash database because file I/O is done by the page and its frequency is much less. However, if records are accessed randomly, performance of the file tree database is worse than the file hash database. Likewise, space efficiency of the file tree database can be better than the file hash database if records are inserted in order. However, if records are inserted randomly, one thirds of storage space is used for padding. As with the file hash database, rebuilding the database can reduce the file size and it is done without blocking other threads.
See the tips of Tuning TreeDBM too. Tuning has a significant influence on performance. For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->Jump("ba");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "ba")) break;
std::cout Next();
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
This is a code example which represents a more serious use case with performance tuning and thorough error checks.
iter = dbm.MakeIterator();
if (iter->First() != Status::SUCCESS) {
// Failure of the First operation is critical so we stop.
Die("First failed: ", status);
}
while (true) {
// Retrieves the current record data.
std::string iter_key, iter_value;
status = iter->Get(&iter_key, &iter_value);
if (status == Status::SUCCESS) {
std::cout Next();
if (status != Status::SUCCESS) {
// This could happen if another thread removed the current record.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
std::cerr
This is an example to make a dictionary from a TSV file. The keys are normalized into lower case and forward matching queries are supported.
columns = StrSplit(line, "t");
if (columns.size() iter = dbm.MakeIterator();
iter->Jump("app");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "app")) break;
std::cout Next();
}
// Close the input file
file.Close();
// Closes the database.
dbm.Close();
return 0;
}
]]>
Format
The tree database file is based on the hash database. Metadata of B+ tree are stored in the opaque metadata section and nodes of B+ tree are stored as records of the hash database.
The opaque metadata section of the hash database is a space to store an arbitrary 64-byte string, which the tree database uses to store its metadata. It contains the following fields.
- Magic data
- From the offset 0. A string “TDB” and a training null ‘