FoundationDB and SQLite
FoundationDB’s use of SQLite’s btree as its own storage engine should not be used to support claims that SQLite has a high quality and performant storage implementation.
Why SQLite
Designing and writing a storage engine is a substantial amount of effort. Previous works have shown it to be around 2-4 person-years of effort to get to a first release. Distributed database projects often first use an existing embedded database as their storage engine. This approach saves some person-years of effort, which can be better applied elsewhere, and works well enough until the project is sufficiently mature to be able to justify and staff developing its own. This pattern has been followed by other projects as well.
FoundationDB began in 2009. Let’s review the landscape of embedded databases around this time:
-
BerkeleyDB: 1994
-
Tokyo Cabinet: 2006
-
Embedded InnoDB[1]: 2009 [1]: There’s been multiple attempts at "embedded innodb" over time, and none of them have ever stuck. Innobase Oy released an official one in 2009. There was another attempt from the MySQL fork Drizzle, which relased HailDB. It was then again resurrected as Embedded InnoDB. A GPL license was attached in all cases, as it was derived from MySQL, which I suspect contributes to its lack of adoption.
-
LevelDB: 2011
-
LMDB: 2011
-
RocksDB: 2012
-
WiredTiger: 2012
The Bad
No Checksums
SQLite assumes that the detection and/or correction of bit errors caused by cosmic rays, thermal noise, quantum fluctuations, device driver bugs, or other mechanisms, is the responsibility of the underlying hardware and operating system. SQLite does not add any redundancy to the database file for the purpose of detecting corruption or I/O errors. SQLite assumes that the data it reads is exactly the same data that it previously wrote.
This is an unusual stance for a database to take, and one that I personally consider to be irresponsible. Underlying hardware is widely known to not be perfectly reliable, especially on cheap consumer devices where SQLite is frequently found. Operating Systems do not generally make any effort at detecting or fixing such corruption. Databases generally assume and acknowledge this risk.
Let’s compare this stance to the same list of embedded databases above:
| Database | Checksums Supported? | Enabled By Default? | Notes |
|---|---|---|---|
BerkeleyDB |
Yes |
Yes |
Controllable using set_flags via DB_CHKSUM |
Tokyo Cabinet |
No |
N/A |
|
InnoDB |
Yes |
Yes |
controllable via innodb_checksum_algorithm |
And Oracle RDBMS, Microsoft SQL Server, MySQL, and Postgres all implement page-level checksums. It is the common, expected default that a database should not provide corrupted results to an application or user, but SQLite disavows responsibility of this. Note that this transitively casts doubt on all "replicated SQLite" projects. Rqlite, dqlite, Litestream, etc., all implicitly inherit the ignorace of disk corruption and the risks that pose. The replication in those situations does nothing to resolve or mitigate the corruption risk.
FoundationDB modified the SQLite B-Tree to add page-level checksums, via the sqlite codec interface. A FoundationDB process that encounters a database corruption will kill itself and remove its local data, and re-replicates a correct copy of the data from another replica.
Sqlite, as of 2020, offers the Checksum VFS Shim, which is a plugin that adds checksum verification by adding an intermediate layer between the filesystem and its btree. However, this support is not enabled in the default build, it is not enabled by distros (or at least in the Fedora 36 I’m currently running), nor can I find a separate RPM packaging this plugin for use. So, though it’s positive that this now exists, the barriers of access to it means that nearly all sqlite users are still exposed to potential undetected data corruptions.
Poor Branching Factor
In Defense of SQLite
FoundationDB (intentionally) mis-used the SQLite B-Tree. SQLite actually implements two different B-Trees. Its first B-tree only supports an 8-byte integer as the key, and is a B+-Tree, with all of the values only existing at the left level. This is used as the primary index, where the row ID is the key, and the value is the row itself. The second btree supports arbitrary key sizes, and holds the value inline with the key in all intermediate nodes. It is used as the btree for secondary indexes in SQLite, where the value is expected to be a row id, and thus be small and cheap to hold inline.
As the latter was the only implementation that supported arbitrary length keys and values, FoundationDB stores all of its data in the "secondary index" B-Tree. This means that the "poor branching factor" argument was largely self-inflicted. "Arbitrary length" values are typically significantly larger than the 8-byte values that SQLite optimized for.
But lacking checksums is still reprehensible.
Future
There are two parallel efforts ongoing (as of 2022) to finally develop a better replacement for SQLite’s btree.
B-Tree: Redwood
LSM: RocksDB
Conclusions
Using SQLite’s btree as storage for SQLite is okay. It’s not amazingly efficient along any dimension, but it does a reasonable job, and makes reasonable design tradeoffs. Using SQLite’s secondary indexing btree as a general key-value storage engine is largely not recommended.
See discussion of this page on Reddit, Hacker News, and Lobsters.