Using SQLite as a Custom File Format

First published in Micro Mart #1426, August 2016

When we create an application we will almost certainly need a way to store custom data. For small amounts of data we can use .ini files or on Windows we can use the registry, but for larger amounts of data (e.g., the application's notion of a “document”), we will need to create a storage format. Such formats can be binary or plain text, custom or predefined.

For example, the SVG image format is plain text (XML markup; UTF-8 encoding) and predefined, whereas the FLAC audio format is binary and predefined. In this article we will discuss custom formats, and show how to use SQLite as a custom binary format.

For custom data, using a plain text format is very appealing. Such formats are human readable and so tend to be much easier to debug than binary formats. And if we use an existing text format such as JSON or XML, libraries exist in almost every language for reading and writing them. For small documents, such formats work well, but for large amounts of data they can be quite slow to read and write, especially XML.

Creating a custom binary format can be tricky, especially when debugging, but can produce very compact documents which are very fast to read and write.

The main problem with custom plain text and binary formats is that it can be quite painful to change them. Suppose, for example, that version 2 of our application needs to store additional data. We need to be able to read our version 1 files and fill in the gaps with sensible defaults so that they can be saved as version 2 files. Over the years I've tried many different solutions both plain text and binary, but nowadays I use a one-size-fits-all solution: every custom document type I create is a SQLite database.

SQLite's “Lite” may make it sound flaky but in fact it is the most widely deployed database in the world (inside practically every smartphone), and its test suite at well over three billion automated tests provides 100% code coverage — so it is extremely reliable. SQLite is an in-process database which means that it exists inside the same operating system process as the application that uses it, so it is normally used as a single-user database. Each SQLite database is stored in a single file, no matter how many tables, views, and triggers it contains, so it is very easy to use as an application's document storage format.

Let's imagine that we want to store simple card index cards, each with a title and a body:

class IndexCard:
    def __init__(self, title, body=None):
        self.title = title
        self.body = body

We could, of course, store them in memory, loading and saving them all in one go. But if we use SQLite, we only need to read in the cards we're interested in and can add, delete, or update them individually with very little memory overhead. If our data items were a lot bigger and more complex and used a lot more memory per-item, this could make a big difference to performance.

We'll hold our cards in a model that wraps a SQLite database and which uses the third-party APSW library rather than the standard library's rather weaker sqlite3 module.

VERSION = 1

class IndexCardModel:
    def __init__(self, filename):
        create = not os.path.exists(filename)
        self.db = apsw.Connection(filename)
        cursor = self.db.cursor()
        cursor.execute(Sql.Prepare)
        pragma = Sql.SetVersion.format(VERSION)
        with Transaction(self.db) as cursor:
            if create:
                cursor.execute(Sql.Create)
            else:
                version = getfirst(Sql.GetVersion)
                if version < VERSION:
                    pass # Update Database structure here
            cursor.execute(pragma)

A model must be created with a filename which may be an existing database of cards or a non-existent file that must be created. The Transaction class is from our Automated Cleanup with Python article. SQLite allows us to set and get a user_version from a database. We use it here to set our database version when creating a new database and to check if an existing database is up-to-date.

The Sql namespace is one we have created like this:

Sql = types.SimpleNamespace()
Sql.GetVersion = "PRAGMA user_version;"
Sql.SetVersion = "PRAGMA user_version = {};"
Sql.Prepare = """PRAGMA foreign_keys = 1;
                 PRAGMA temp_store = MEMORY;"""
Sql.Create = """CREATE TABLE cards (
                cid INTEGER PRIMARY KEY NOT NULL,
                title TEXT NOT NULL, body TEXT);"""
Sql.Add = """INSERT INTO cards (title, body)
             VALUES (:title, :body);
             SELECT LAST_INSERT_ROWID();"""
Sql.Insert = """INSERT INTO cards (cid, title, body)
                VALUES (:cid, :title, :body);"""
Sql.Update = """UPDATE cards SET title = :title, body = :body
                WHERE cid = :cid;"""
Sql.Delete = "DELETE FROM cards WHERE cid = :cid;"
Sql.Get = "SELECT title, body FROM cards WHERE cid = :cid;"
Sql.Select = """SELECT cid, title, body FROM cards
                ORDER BY LOWER(title);"""
Sql.Count = "SELECT COUNT(*) FROM cards;"
Sql.Exists = "SELECT COUNT(*) FROM cards WHERE cid = :cid;"

In this particular case we could have used a collections.namedtuple since the values aren't changed, but using a types.SimpleNamespace makes for an easier syntax in this case.

def getfirst(cursor, sql, params=None, Class=int):
    params = {} if params is None else params
    first = cursor.execute(sql, params).fetchone()[0]
    return bool(int(first)) if isinstance(Class, bool) else Class(first)

This is a helper function for extracting the first field from the first record produced by a query and returning it with the correct type. We have to special case Booleans because bool("0") is True (because it is a non-empty string), but bool(int("0")) is False since bool(0) is False.

    def close(self):
        self.db.close()
        self.db = None
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.db.close() 

These methods allow a model instance to be used in a with statement, or in a try ... finally block with an explicit call to close() to guarantee finalization.

    def add(self, card): # cid = model.add(card)
        with Transaction(self.db) as cursor:
            return getfirst(cursor, Sql.Add, vars(card))

This method adds a new card and returns its cid (Card ID). Python's built-in vars() function returns a key-value dictionary of the object it is given, i.e., a copy of the object's __dict__, if it has one. So for IndexCards it returns a {"title": "card title", "body": "card body"} dict. APSW's cursor.execute() method takes a SQL query string and optionally a dict where the values are substituted for the corresponding keys in the query. The Sql.Add query has two separate statements, the second returning the newly inserted card's cid.

    def __getitem__(self, cid): # card = model[cid]
        with Transaction(self.db) as cursor:
            row = cursor.execute(Sql.Get, dict(cid=cid)).fetchone()
        return IndexCard(*row) if row is not None else None

This special method allows us to retrieve a card by its cid. We've chosen to fail safe by returning None for an invalid cid, but perhaps a better alternative would be to raise a custom InvalidCidError exception.

    def __delitem__(self, cid): # del model[cid]
        with Transaction(self.db) as cursor:
            cursor.execute(Sql.Delete, dict(cid=cid))

This method allows a card identified by its cid to be deleted. Strictly speaking we don't need a transaction for this method or the previous one since they are both single statements, but using a transaction is safer since if the SQL was changed to be two or more statements they would still work correctly.

    def __setitem__(self, cid, card): # model[cid] = card
        d = vars(card)
        d["cid"] = cid
        with Transaction(self.db) as cursor:
            exists = getfirst(cursor, Sql.Exists, d, bool)
            sql = Sql.Update if exists else Sql.Insert
            cursor.execute(sql, d)

The transaction is needed here because we do two separate statements both of which must succeed or fail. The first statment is used to tell us if we're updating an existing card or replacing (really, inserting) with a new card, and the second to perform the update or insert.

    def __iter__(self): # for cid, card in model:
        with Transaction(self.db) as cursor:
            for row in cursor.execute(Sql.Select):
                cid = row[0]
                card = IndexCard(row[1], row[2])
                yield cid, card

Users of our model can iterate over all the cards using this method. If we wanted to allow a constraint (e.g., all cards with an empty body or all cards with a title containing a specific word), we would have to provide a separate method, perhaps with the signature iterate(self, where="", orderby="ORDER BY LOWER(title)").

    def __len__(self): # size = len(model)
        with Transaction(self.db) as cursor:
            return getfirst(cursor, Sql.Count)

This method returns the number of cards in the database, and completes our implementation.

You might like to consider raising a custom InvalidCidError if a non-existent cid is passed to __getitem__(), __delitem__(), or __setitem__(). And more interestingly, you might like to add support for undo/redo, although doing so is quite a bit of work.

For those who don't much like writing raw SQL, an excellent alternative is to use an Object-Relational Mapper (ORM) such as SQLAlchemy or sqlobject.org's SQLObject. These make it possible to create Python classes such that a class maps to a database table and an instance to a database record with all the underlying SQL handled by the ORM.

For more see Python Programming Tips

Top