LPC-2019

SQLite on Linux
Login

This is a briefing on SQLite intended for Linux kernel hackers, and especially those working on Linux filesystems.

1.0 SQLite Is Not Like Other Database Engines

2.0 Filesystem Properties That SQLite Would Like To Know About

SQLite works unsupervised, in whatever environment it is handed. SQLite does not get to choose a particular filesystem type or specific mount options. There is no configuration file available to tell SQLite about the capabilities of system on which it is running. SQLite needs to work well on a USB memory stick with a DOS filesystem up to an enterprise-class SSD with battery back-up and the latest filesystem, and everything in between.

To help it work most efficiently, SQLite needs to know attributes of the environment in which it is running. And because it lacks a configuration file, SQLite needs to discover these attributes on its own, at run-time. Some attributes of the system that SQLite would like to know about include:

3.0 Transactions

Like all relational databases, SQLite needs to implement transactions that are ACID (Atomic, Consistent, Isolated, and Durable) even if the application crashes (for example by SIGKILL) or if there is an unexpected power loss. This is achieved in one of three ways:

  1. A rollback journal
  2. A write-ahead log
  3. Using atomic-write capabilities of the underlying filesystem.

The rollback journal approach is the slowest, but also the most likely to work on any filesystem. Rollback is therefore the default. The write-ahead log (WAL mode) is faster, but only works on systems for which either (1) only a single process accesses the database at a time or (2) a mmap()-ed file can be used as shared memory.

The third option is the fastest but is currently only supported for the F2FS filesystem on Linux.

Each database file is either in rollback-mode or in WAL-mode. All processes access the database file according to its defined mode. Changing the mode of a database requires exclusive access to the database. The F2FS atomic write capability (option 3) is an extension of rollback mode (option 1).

3.1 Transactions using a rollback journal

An SQLite database is a sequence of one or more "pages". All pages in the same database file are the same size. But for different databases, the page size can be any power of two between 512 and 65536. A single change to the database typically involves modifications to multiple pages. This is done atomically as follows:

  1. Creating a rollback journal file in the same directory as the database and having the same name except with the "-journal" extension added.

  2. The original content of any database page that is to be modified is written into the rollback journal.

  3. Call fdatasync() on the rollback journal.

  4. Open the directory that contains the rollback journal and fdatasync() that directory, to ensure that the filename will exist following a power loss

  5. Acquire an exclusive lock on the database file

  6. Barrier - All of the above must complete before any of the following.

  7. Make changes to the database file.

  8. Call fdatasync() on the database file. If the database file decreases in size, also call ftruncate().

  9. Barrier - All of the above must complete before any of the following.

  10. Commit the transaction by doing one of:

    1. Unlink the rollback journal
    2. Truncate the rollback journal to zero bytes in size
    3. Overwrite the first 8 bytes of the rollback journal with zeros

  11. Drop the exclusive lock

Notes:

3.2 Recovery Of A Rollback Journal Following A Crash

  1. When initially opening the database file, check for the presence of a well-formed rollback journal. If not found → Done.

  2. Play back the rollback journal into the database file.

  3. Call fdatasync() on the database file.

  4. Barrier - All of the above must complete before any of the following.

  5. Disable the rollback journal using one of the following:

    1. Unlink the rollback journal
    2. Truncate the rollback journal to zero bytes in size
    3. Overwrite the first 8 bytes of the rollback journal with zeros

3.3 Transactions using write-ahead log

  1. Append modified pages to the write-ahead log file (the "WAL file"). The WAL file is a file in the same directory as the database and with the same name as the database but with "-wal" appended.

  2. Append a "commit mark"

  3. Call fdatasync() on the WAL file to commit the transaction

Notes:

3.4 Checkpointing A Write-Ahead Log

  1. Call fdatasync() on the "-wal" file.

  2. Wait for all concurrent readers to stop using pages in the main database that have changes in the WAL file.

  3. Barrier - All of the writes to the WAL file must complete before any subsequent writes to the database file.

  4. Write the most recent change for every page in the WAL file back into the database file.

    • Pages are sorted so that they are written in increasing order. → Is this helpful on Linux? Or can we just as well skip the sort and write the pages in any arbitrary order?
  5. Call fdatasync() on the database file.

  6. Barrier - All of the above must complete before any of the following.

  7. Truncate the WAL file, or otherwise cause the WAL file to start over again at the beginning.

    • We have seen that overwriting an existing file is faster than truncating and appending. Is that always the case?

Notes:

3.5 Recovery Of A Write-Ahead Log Following A Crash

  1. Rebuild the "-shm" file by scanning the "-wal" file.

3.6 Transactions using the atomic-write capabilities of F2FS

F2FS is a log-structured filesystem for Linux that has (limited) atomic write capabilities. SQLite is able to use the atomic write capabilities of F2FS to bypass the rollback journal. Anecdotal reports are that an Android phone that is reformatted to use F2FS is noticeably faster. F2FS make an old tired phone feel like a perky new phone.

The atomic write capability in F2FS is implemented by three ioctl()s:

The F2FS atomic write capability is only useful for databases that would otherwise be in rollback mode. The atomic write capability is not helpful for WAL-mode databases.

Transactions using F2FS atomic write proceed as follows:

  1. Start a new transaction by invoking F2FS-BEGIN. If that ioctl() fails, fallback to using a rollback journal.

  2. Write changes directly into the database file. If any write fails, SQLite will abort the transaction by invoking F2FS-ROLLBACK and restart the transaction using a rollback journal.

  3. If the application requests a transaction abort by issuing an SQL "ROLLBACK" statement, then SQLite aborts the atomic write by invoking the F2FS-ROLLBACK ioctl().

  4. Commit the transaction by invoking the F2FS-COMMIT ioctl(). If that ioctl() fails, SQLite will retry the transaction using a traditional rollback journal.

Notes:

4.0 Potential Enhancements

Performance of SQLite on Linux is already very fast. However, it might be improved further with new Kernel interfaces.

4.1 fbarrier()

SQLite uses fsync() or fdatasync() to ensure I/O operations are complete. However, if an application is willing to forego Durability (The "D" in "ACID") following a power loss then many of these fdatasync() calls could be replaced by a hypothetical fbarrier(). This would be useful in each spot marked above by "Barrier".

In order to be useful, the hypothetical fbarrier() operation must:

4.2 Designate unused portions of a file

When an application deletes content from an SQLite database, SQLite does not normally overwrite the deleted content, but merely remembers that the space is available for reuse. This avoids unnecessary writes. (Exception: Sometimes an application actually wants content to be overwritten to avoid forensic traces, and SQLite supports this upon request. For example, Firefox puts SQLite into the mode where it overwrites deleted content with zeros when you clear your search history.)

SQLite could tell the filesystem about regions of the file that are unused and do not need to be preserved.