and [CoW](https://en.wikipedia.org/wiki/Copy-on-write).
- Multiple key-value sub-databases within a single datafile.
- Range lookups, including range query estimation.
- Efficient support for short fixed length keys, including native 32/64-bit integers.
- Ultra-efficient support for [multimaps](https://en.wikipedia.org/wiki/Multimap). Multi-values sorted, searchable and iterable. Keys stored without duplication.
- Data is [memory-mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) and accessible directly/zero-copy. Traversal of database records is extremely-fast.
- Transactions for readers and writers, ones do not block others.
- Writes are strongly serialized. No transaction conflicts nor deadlocks.
- Readers are [non-blocking](https://en.wikipedia.org/wiki/Non-blocking_algorithm), notwithstanding [snapshot isolation](https://en.wikipedia.org/wiki/Snapshot_isolation).
-`Olog(N)` cost of lookup, insert, update, and delete operations by virtue of [B+ tree characteristics](https://en.wikipedia.org/wiki/B%2B_tree#Characteristics).
- Online hot backup.
- Append operation for efficient bulk insertion of pre-sorted data.
- No [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) nor any
transaction journal. No crash recovery needed. No maintenance is required.
- No internal cache and/or memory management, all done by basic OS services.
## Limitations
- **Page size**: a power of 2, maximum `65536` bytes, default `4096` bytes.
- **Key size**: minimum 0, maximum ≈¼ pagesize (`1300` bytes for default 4K pagesize, `21780` bytes for 64K pagesize).
- **Value size**: minimum 0, maximum `2146435072` (`0x7FF00000`) bytes for maps, ≈¼ pagesize for multimaps (`1348` bytes default 4K pagesize, `21828` bytes for 64K pagesize).
- **Write transaction size**: up to `4194301` (`0x3FFFFD`) pages (16 [GiB](https://en.wikipedia.org/wiki/Gibibyte) for default 4K pagesize, 256 [GiB](https://en.wikipedia.org/wiki/Gibibyte) for 64K pagesize).
- **Database size**: up to `2147483648` pages (8 [TiB](https://en.wikipedia.org/wiki/Tebibyte) for default 4K pagesize, 128 [TiB](https://en.wikipedia.org/wiki/Tebibyte) for 64K pagesize).
- **Maximum sub-databases**: `32765`.
## Gotchas
1. There cannot be more than one writer at a time, i.e. no more than one write transaction at a time.
2._libmdbx_ is based on [B+ tree](https://en.wikipedia.org/wiki/B%2B_tree), so access to database pages is mostly random.
Thus SSDs provide a significant performance boost over spinning disks for large databases.
3._libmdbx_ uses [shadow paging](https://en.wikipedia.org/wiki/Shadow_paging) instead of [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging). Thus syncing data to disk might be a bottleneck for write intensive workload.
4._libmdbx_ uses [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) for [snapshot isolation](https://en.wikipedia.org/wiki/Snapshot_isolation) during updates, but read transactions prevents recycling an old retired/freed pages, since it read ones. Thus altering of data during a parallel
long-lived read operation will increase the process work set, may exhaust entire free database space,
the database can grow quickly, and result in performance degradation.
Try to avoid long running read transactions.
5._libmdbx_ is extraordinarily fast and provides minimal overhead for data access,
so you should reconsider using brute force techniques and double check your code.
On the one hand, in the case of _libmdbx_, a simple linear search may be more profitable than complex indexes.
On the other hand, if you make something suboptimally, you can notice detrimentally only on sufficiently large data.
## Comparison with other databases
For now please refer to [chapter of "BoltDB comparison with other
> During each commit _libmdbx_ merges suitable freeing pages into unallocated area
> at the end of file, and then truncates unused space when a lot enough of.
5. The same database format for 32- and 64-bit builds.
> _libmdbx_ database format depends only on the [endianness](https://en.wikipedia.org/wiki/Endianness) but not on the [bitness](https://en.wiktionary.org/wiki/bitness).
6. LIFO policy for Garbage Collection recycling. This can significantly increase write performance due write-back disk cache up to several times in a best case scenario.
> LIFO means that for reuse will be taken the latest becomes unused pages.
> Therefore the loop of database pages circulation becomes as short as possible.
> In other words, the set of pages, that are (over)written in memory and on disk during a series of write transactions, will be as small as possible.
> Thus creates ideal conditions for the battery-backed or flash-backed disk cache efficiency.
7. Fast estimation of range query result volume, i.e. how many items can
be found between a `KEY1` and a `KEY2`. This is a prerequisite for build
and/or optimize query execution plans.
> _libmdbx_ performs a rough estimate based on common B-tree pages of the paths from root to corresponding keys.
8.`mdbx_chk` utility for database integrity check.
Since version 0.9.1, the utility supports checking the database using any of the three meta pages and the ability to switch to it.
9. Automated steady sync-to-disk upon several thresholds and/or timeout via cheap polling.
10. Sequence generation and three persistent 64-bit markers.
11. Handle-Slow-Readers callback to resolve a database full/overflow issues due to long-lived read transaction(s).
12. Support for opening databases in the exclusive mode, including on a network share.
## Added Abilities
1. Zero-length for keys and values.
2. Ability to determine whether the particular data is on a dirty page
or not, that allows to avoid copy-out before updates.
3. Ability to determine whether the cursor is pointed to a key-value
pair, to the first, to the last, or not set to anything.
4. Extended information of whole-database, sub-databases, transactions, readers enumeration.
> _libmdbx_ provides a lot of information, including dirty and leftover pages
> for a write transaction, reading lag and holdover space for read transactions.
5. Extended update and delete operations.
> _libmdbx_ allows one _at once_ with getting previous value
> and addressing the particular item from multi-value with the same key.
## Other fixes and specifics
1. Fixed more than 10 significant errors, in particular: page leaks,
wrong sub-database statistics, segfault in several conditions,
nonoptimal page merge strategy, updating an existing record with
a change in data size (including for multimap), etc.
2. All cursors can be reused and should be closed explicitly,
regardless ones were opened within a write or read transaction.
3. Opening database handles are spared from race conditions and
pre-opening is not needed.
4. Returning `MDBX_EMULTIVAL` error in case of ambiguous update or delete.
5. Guarantee of database integrity even in asynchronous unordered write-to-disk mode.
> _libmdbx_ propose additional trade-off by `MDBX_SAFE_NOSYNC` with append-like manner for updates,
> that avoids database corruption after a system crash contrary to LMDB.
> Nevertheless, the `MDBX_UTTERLY_NOSYNC` mode is available to match behaviour of the `MDB_NOSYNC` in LMDB.
6. On **MacOS & iOS** the `fcntl(F_FULLFSYNC)` syscall is used _by
default_ to synchronize data with the disk, as this is [the only way to
- Linear scale on left and dark rectangles mean arithmetic mean of
thousands transactions per second;
- Logarithmic scale on right in seconds and yellow intervals mean
execution time of transactions. Each interval shows minimal and maximum
execution time, cross marks standard deviation.
**1,000,000 transactions in async-write mode**. In case of a crash all data is consistent and conforms to the one of last successful transactions, but lost transaction count is much higher than in
lazy-write mode. All DB engines in this mode do as little writes as
possible on persistent storage. _libmdbx_ uses
[msync(MS_ASYNC)](https://linux.die.net/man/2/msync) in this mode.
In the benchmark each transaction contains combined CRUD operations (2
inserts, 1 read, 1 update, 1 delete). Benchmark starts on an empty database
and after full run the database contains 10,000 small key-value records.
#### This is a mirror of the origin repository that was moved to [abf.io](https://abf.io/erthink/) because of discriminatory restrictions for Russian Crimea.