Forgotten art of file-based "databases"

I bet that whenever you think of a simple database, you think of sqlite. There’s nothing wrong with that but the filesystem itself can be an even simpler substitute.
Simple tools, simple solutions
I often write simple shell scripts and command line tools which just complement my daily workflow. These aren’t anything fancy. They’re not intended to scale, cater to a large users base or anything like that. Just a single purpose tools that I want to have up and running quickly and focus on something else. Once written, there’s very few changes that follow up and they mostly stay the same, get replaced by something else or scrapped entirely.
So, if I need a database, why not just use sqlite? I often do but it’s not always convenient. It’s the data model that stands in the way. I prefer my small utilities to be self contained. Ideally, a single file shell scripts. There’s always a problem with migrations, good place to store the SQL schema definitions and things like that. Ultimately, if the tool requires maintenance, I know I’m not gonna use it.
I tend to resort to a simpler solution most of the time, namely the file system itself. A giant database that we use all the time. Analogous to SQL database, the directories can be your tables and the files themselves can be rows in these tables.
A database with a notes table can be modelled like so:
|
|
UUID can be used for auto-generated filenames (acting as primary keys) but in simple cases I often just resort to timestamps.
A directory with a set of files is equivalent to a table. The files are key-value stores with the filename acting as a primary key and the contents as the value.
More columns
Okay, but what if you need more fields? As usual, the problem can be solved by adding another level of indirection - one more level in directory tree in this case. Here’s an example:
|
|
The row becomes a directory within notes. Its name is the primary key. The files in this directory are the columns. Each file name is a column name and the contents model the value for the column in a given row.
In case of the example above, a single note has four fields: uuid acting as the primary key, body - containing note’s main content, summary - containing note’s summary and note’s title itself in its respective file.
Of course this is only one approach and the data can be modelled in many different ways using this method. An alternative might be to store all data in a single file, structured as e.g. JSON or TOML.
Relations
Since we know how to model individual fields and we’ve got primary keys in each table. It’s possible to introduce relations as well.
Imagine that there’s another table containing users:
|
|
Then, within notes it’s just a matter of modelling another field - user_id containing the UUID of the user owning the note:
|
|
Of course you won’t get automatic joins and all the features that a “real” database offers. I’d argue though, that if your problem is complex enough that it requires joins across multiple tables and, in general, operates on an elaborate data model, then you’ve probably chosen wrong tools for the job.
I bet that none if this is performant to be used at any kind of scale but again, it never meant to be to begin with (and I’ve never bothered to do any benchmarks myself).
Locking
mkdir to the rescue
Simplest naive approach may involve using mkdir
as an atomic lock; mkdir
creates a directory or returns an error exit status if it already exist. This
can be implemented like so:
|
|
Considering the requirements this will suffice in most cases. Of course, this approach has many downsides. It will suffer from stale leftover locks in case of SIGKILL - I usually just add a convenience command to handle the cleanup manually.
flock
This is a more robust alternative. It can be used as a wrapper for a single command:
flock -x -w <timeout> <lockfile> <command>
flock -x -w 5 lockfile sleep 1
flock -s -w 5 lockfile cat /some/file
flock -x -w 5 lockfile echo >/some/file
The -x/-s flags model the semantics of exclusive/shared locks which are basically an equivalents of a POSIX read/write locks (you take a read/shared lock if read access is required only. The lock will be acquired if there’s no writer holding it already. There may be more than one reader holding a read lock though. You take the write/exclusive lock if you want to modify the data. It will be acquired if there are no other locks held and it pre-empts acquisition of any other locks while held).
In scripts, the second form is more useful:
|
|
I usually structure my scripts to have a main
function so, this works out
perfectly (the file descriptor number is chosen completely arbitrary):
|
|
The only downside (I know of) with flock
is that supposedly there are
compatibility problems when used on NFS1 file systems (both client and the server
may acquire an exclusive lock to the same file) - but I haven’t tested that
myself.
Putting it all together
Imagine you’re creating a simple memo taking tool in bash. Your memo is just a chunk of text and maybe a creation timestamp. Nothing fancy. It’ll support a couple of commands:
edit [UUID]
- given an UUID it will open an$EDITOR
with an existing memo or create a new onecat <UUID>
- just dumps the memo to terminalgrep PATTERN
- will search for content within memoslist [ASC|DESC] [LIMIT]
- will list memos
Let’s start with some basics:
|
|
That’s the skeleton I usually create. I like to define commands as separate functions with special names. In this case, prefixed with cmd_memo_:
|
|
This allows for automatic dispatch of command to a particular function and gives listing of available commands for free. I can now do:
memos.sh cat
memos.sh grep
memos.sh list
...
The implementation of each database commands is now very simple. Let’s start with memo editing:
|
|
Similarly, cat
and grep
is just a matter of path construction:
|
|
list
is slightly more complex as it allows for limiting output and basic
sorting:
|
|
Why is this useful?
Continuing with a memo example, having data structured as a set of simple files allows for more tooling to be built on top of this simple note taking tool. Access is trivial, parsing is simple. Since the data is accessible from the terminal it’s easy to transform it to other format as well. This opens a lot of opportunities for further experimentation.