How - SQLite
Lets start with some basics. A basic unit of storage for a database is a page. Pages are generally 4k or 8k. An SQLite file is a sequence of pages one after another.
SQLite also supports WAL mode. This means during writes, pages are first placed in a write-ahead-log and later checkpointed (copied) to the main file. When SQLite is trying to read a page, it first checks an in-memory hash table if that page has to be read from WAL, otherwise it reads from the main database file.
If you want to run SQLite on top of something else, the WAL module is an ideal place to do it. It has a very clean interface to the rest of SQLite: init, find page, read page, write page, checkpoint.
It is also the perfect place to implement compression and encryption.
Why
The best way we can describe ActorDB is that it is an ideal server-side database for apps. Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries to the database is within a specific user.
SQLite is great for storing per-user data. It supports common table expressions, foreign keys, even json columns. A single instance is completely contained and many can be run at the same time.
In previous versions of ActorDB, we hacked SQLite to combine all WAL files into one. This made writing to many SQLite instances at the same time reasonably efficient. But we still had the base file separate for every user. This is less than ideal.
Then we found LMDB. It is efficient, compact and rich with features. It is the only viable storage engine we can use out of the box.
LMDB writes will keep data exactly where it was placed the first time it was written. This is important because our writes are up to 4k, which is quite large. Up to because every page is compressed using lz4.
LMDB has transactions. We can atomically store multiple k/v tuples. Very important because every write, usually changes more than one page and we need to store some meta data as well.
LMDB can have multiple databases contained in the file.
LMDB can store multiple sorted values for a given key (called a dupsort). This is a very important feature we make heavy use of.
How - LMDB
Dupsorts are important because of replication. ActorDB uses raft distributed consensus protocol. You can add new nodes at any time and it will rebalance itself without down time.
Raft is a protocol that is built around a write log. All our data is within LMDB. Thus we must maintain a write log inside LMDB, without requiring checkpoints that copy data some place else. Once a page is written it should stay where it is until it is no longer required.
Our LMDB schema is made out of 4 databases:
- Actors database: {ActorName, ActorId}
ActorId is a 64bit integer.
- Pages database: {PageKey, [Page1,Page2,..]}
PageKey = <<ActorId:64, PageNumber:32>>
Page = <<EventTerm:64, EventNum:64, Frag:8, CompressedPage/binary>>
This is a dupsort. For every SQLite page index, we store a list of actual pages. From oldest to latest.
- Log database: {LogKey,[PageNumber1, PageNumber2,..]}
LogKey = <<ActorId:64, EventTerm:64, EventNumber:64>>
This is a dupsort. Value is a sorted list of page numbers. Key is made out of actor id, raft event term and raft event number.
- Info database: {ActorId, InfoVal}
InfoVal = <<FirstTerm:64, FirstEvnum:64, LastTerm:64, LastEvnum:64, InProgTerm:64, InProgEvnum:64, (and some more raft data..)>>
Info database stores our log limits. First write is at N, last write is N+M.
Every write is a transaction to pages, log and info databases. Pages stores the actual data, log stores the index so we know which pages were involved in the write, info increments LastTerm/LastEvnum.
Every read is a simple lookup inside the pages database. Given an ActorId and a desired page number, return the last value in the dupsort for that key.
Replication meta data is stored in the log and info databases.
We still need to do checkpoints, but they do not require copying pages around. All we need to do is delete the oldest data from log and pages databases, then update the info database.
This LMDB schema gives us lots of flexibility. For instance, if we want to move actor A1 from node N1 to node N2.
1. Lock A1 for checkpoints.
2. Create an iterator for last event number (LEV).
3. For every key in pages database for A1, read the last page in dupsort that is <= LEV.
4. Once we have copied the last page, lock A1 for writes.
5. Repeat the copy process for every write that happened during the copy process.
6. Once all pages are copied, delete local data and redirect writes to node N2.
The above is required in multiple cases:
- Cluster rebalancing.
- We want to create an actor as a copy of another (without step 6).
- Very stale node has turned on (without step 6).
Right now we are quite satisfied with how our storage engine is performing. Using LMDB has resulted in a large performance boost and a cleaner code base.
Try it out!
http://www.actordb.com/
https://github.com/biokoda/actordb
Written by Sergej Jurecko