Contents

Forgotten art of file-based "databases"

I bet that whenever you think of a simple database, you think of sqlite. There’s nothing wrong with that but the filesystem itself can be an even simpler substitute.

Simple tools, simple solutions

I often write simple shell scripts and command line tools which just complement my daily workflow. These aren’t anything fancy. They’re not intended to scale, cater to a large users base or anything like that. Just a single purpose tools that I want to have up and running quickly and focus on something else. Once written, there’s very few changes that follow up and they mostly stay the same, get replaced by something else or scrapped entirely.

So, if I need a database, why not just use sqlite? I often do but it’s not always convenient. It’s the data model that stands in the way. I prefer my small utilities to be self contained. Ideally, a single file shell scripts. There’s always a problem with migrations, good place to store the SQL schema definitions and things like that. Ultimately, if the tool requires maintenance, I know I’m not gonna use it.

I tend to resort to a simpler solution most of the time, namely the file system itself. A giant database that we use all the time. Analogous to SQL database, the directories can be your tables and the files themselves can be rows in these tables.

A database with a notes table can be modelled like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
db/                   # database name
└── notes             # table name
    ├── 100.txt       # table row, filename as primary key
    ├── 101.txt       # ...
    ├── 102.txt
    ├── 103.txt
    ...
    ├── 118.txt
    ├── 119.txt
    ...

UUID can be used for auto-generated filenames (acting as primary keys) but in simple cases I often just resort to timestamps.

A directory with a set of files is equivalent to a table. The files are key-value stores with the filename acting as a primary key and the contents as the value.

More columns

Okay, but what if you need more fields? As usual, the problem can be solved by adding another level of indirection - one more level in directory tree in this case. Here’s an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
db/
└── notes
    ├── 0774a890-81aa-4c7f-a68d-2286d8fdf0c9
    │   ├── body.txt
    │   ├── summary.txt
    │   └── title.txt
    ├── 0e839d09-33fe-4858-9b7f-7580bc29a296
    │   ├── body.txt
    │   ├── summary.txt
    │   └── title.txt
    ├── 16eb2f77-fbc4-48d2-991b-3b186d878c24
    │   ├── body.txt
    │   ├── summary.txt
    │   └── title.txt
    ├── 363b881c-288c-4eae-903a-500f5b8a9d92
    │   ├── body.txt
    │   ├── summary.txt
    │   └── title.txt
    ├── 424db2c8-3318-478f-916c-7766638addb1
    │   ├── body.txt
    │   ├── summary.txt
    │   └── title.txt
    ...

The row becomes a directory within notes. Its name is the primary key. The files in this directory are the columns. Each file name is a column name and the contents model the value for the column in a given row.

In case of the example above, a single note has four fields: uuid acting as the primary key, body - containing note’s main content, summary - containing note’s summary and note’s title itself in its respective file.

Of course this is only one approach and the data can be modelled in many different ways using this method. An alternative might be to store all data in a single file, structured as e.g. JSON or TOML.

Relations

Since we know how to model individual fields and we’ve got primary keys in each table. It’s possible to introduce relations as well.

Imagine that there’s another table containing users:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
└── users
    ├── 193174a6-86e6-49d5-9fa7-0df86d9acdfd
    │   ├── email.txt
    │   └── name.txt
    ├── 30e9b540-0209-4f29-938d-283c6de44ba5
    │   ├── email.txt
    │   └── name.txt
    ├── 48c21553-fb3d-4c7e-8d05-474c49e99ecf
    │   ├── email.txt
    │   └── name.txt
    ...

Then, within notes it’s just a matter of modelling another field - user_id containing the UUID of the user owning the note:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
notes
├── 0774a890-81aa-4c7f-a68d-2286d8fdf0c9
│   ├── body.txt
│   ├── summary.txt
│   ├── title.txt
│   └── user_id.txt
├── 0e839d09-33fe-4858-9b7f-7580bc29a296
│   ├── body.txt
│   ├── summary.txt
│   ├── title.txt
│   └── user_id.txt
├── 16eb2f77-fbc4-48d2-991b-3b186d878c24
│   ├── body.txt
│   ├── summary.txt
│   ├── title.txt
│   └── user_id.txt
...

Of course you won’t get automatic joins and all the features that a “real” database offers. I’d argue though, that if your problem is complex enough that it requires joins across multiple tables and, in general, operates on an elaborate data model, then you’ve probably chosen wrong tools for the job.

I bet that none if this is performant to be used at any kind of scale but again, it never meant to be to begin with (and I’ve never bothered to do any benchmarks myself).

Locking

mkdir to the rescue

Simplest naive approach may involve using mkdir as an atomic lock; mkdir creates a directory or returns an error exit status if it already exist. This can be implemented like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/bin/bash

declare -r DB_LOCATION="$HOME/memo_wip/memos"
declare -a TABLES=("memos")

declare -r DB_LOCK_PATH="$DB_LOCATION/lock"

# populated once lock has been successfully acquired
declare DB_LOCK_TOKEN=

take_lock() {
    mkdir "${DB_LOCK_PATH}" 2>/dev/null && DB_LOCK_TOKEN="$(date)"
}

release_lock() {
    if [ -n "${DB_LOCK_PATH}" ] && [ -n "${DB_LOCK_TOKEN}" ]; then
        rm -d "${DB_LOCK_PATH}"
    fi
}

cleanup() {
    release_lock
}

trap cleanup INT TERM EXIT

setup() {
    [ -d "$DB_LOCATION" ] || mkdir -p "$DB_LOCATION"
    for table in "${TABLES[@]}"; do
        mkdir -p "${DB_LOCATION}/${table}"
    done
}

fatal() {
    echo "$*"
    exit 1
}

main() {
    ...
}

setup
take_lock || fatal "Unable to lock database"

main "$@"

Considering the requirements this will suffice in most cases. Of course, this approach has many downsides. It will suffer from stale leftover locks in case of SIGKILL - I usually just add a convenience command to handle the cleanup manually.

flock

This is a more robust alternative. It can be used as a wrapper for a single command:

flock -x -w <timeout> <lockfile> <command>

flock -x -w 5 lockfile sleep 1
flock -s -w 5 lockfile cat /some/file
flock -x -w 5 lockfile echo >/some/file

The -x/-s flags model the semantics of exclusive/shared locks which are basically an equivalents of a POSIX read/write locks (you take a read/shared lock if read access is required only. The lock will be acquired if there’s no writer holding it already. There may be more than one reader holding a read lock though. You take the write/exclusive lock if you want to modify the data. It will be acquired if there are no other locks held and it pre-empts acquisition of any other locks while held).

In scripts, the second form is more useful:

1
2
3
4
5
6
7
(
    flock -x <file_descriptor>

    # The rest of your script...

    # The lock released when subprocess exits
) fd >/lockfile/path

I usually structure my scripts to have a main function so, this works out perfectly (the file descriptor number is chosen completely arbitrary):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash

declare -r DB_LOCATION="$HOME/memo_wip/memos"
declare -a TABLES=("memos")

declare -r DB_LOCK_PATH="$DB_LOCATION/lockfile"
declare -r DB_LOCK_TIMEOUT=5

setup() {
    [ -d "$DB_LOCATION" ] || mkdir -p "$DB_LOCATION"
    for table in "${TABLES[@]}"; do
        mkdir -p "${DB_LOCATION}/${table}"
    done
}

fatal() {
    echo "$*"
    exit 1
}

main() {
    ...
}

(
    flock -x -w "${DB_LOCK_TIMEOUT}" 137
    setup
    main "$@"
) 137>"${DB_LOCK_PATH}"

The only downside (I know of) with flock is that supposedly there are compatibility problems when used on NFS1 file systems (both client and the server may acquire an exclusive lock to the same file) - but I haven’t tested that myself.

Putting it all together

Imagine you’re creating a simple memo taking tool in bash. Your memo is just a chunk of text and maybe a creation timestamp. Nothing fancy. It’ll support a couple of commands:

  • edit [UUID] - given an UUID it will open an $EDITOR with an existing memo or create a new one
  • cat <UUID> - just dumps the memo to terminal
  • grep PATTERN - will search for content within memos
  • list [ASC|DESC] [LIMIT] - will list memos

Let’s start with some basics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash

declare -r DB_LOCATION="$HOME/memos"
declare -a TABLES=("memos")

declare -r DB_LOCK_PATH="$DB_LOCATION/lockfile"
declare -r DB_LOCK_TIMEOUT=5

setup() {
    [ -d "$DB_LOCATION" ] || mkdir -p "$DB_LOCATION"
    for table in "${TABLES[@]}"; do
        mkdir -p "${DB_LOCATION}/${table}"
    done
}

fatal() {
    echo "$*"
    exit 1
}

help() {
    echo "memos <cmd>"
    echo "Commands:"
    for cmd in "${CMDS[@]}"; do
        echo -e "\t${cmd##${cmd_prefix}}"
    done
    exit 1
}

main() {
    local cmd="${1}"
    [ -z "$cmd" ] && help
    shift
}

(
    # file descriptor chosen arbitrarily
    flock -x -w "${DB_LOCK_TIMEOUT}" 137
    setup
    main "$@"
) 137>"${DB_LOCK_PATH}"

That’s the skeleton I usually create. I like to define commands as separate functions with special names. In this case, prefixed with cmd_memo_:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
cmd_memo_cat() {
    # TODO implement me
}

cmd_memo_edit() {
    # TODO implement me
}

cmd_memo_grep() {
    # TODO implement me
}

cmd_memo_list() {
    # TODO implement me
}

declare -r cmd_prefix="cmd_memo_"

# deliberately allow splitting on space
declare -ar CMDS=($(compgen -A function "${cmd_prefix}" | xargs))

help() {
    echo "memos <cmd>"
    echo "Commands:"
    for cmd in "${CMDS[@]}"; do
        echo -e "\t${cmd##${cmd_prefix}}"
    done
    exit 1
}

main() {
    local cmd="${1}"
    [ -z "$cmd" ] && help
    shift

    local cmdfunc="${cmd_prefix}${cmd}"
    [ -n "$(compgen -A function "${cmdfunc}")" ] && $cmdfunc "$@"
}

This allows for automatic dispatch of command to a particular function and gives listing of available commands for free. I can now do:

memos.sh cat
memos.sh grep
memos.sh list
...

The implementation of each database commands is now very simple. Let’s start with memo editing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
memo_make_filepath() {
    local memo_id="$1"
    local table="memos"
    echo "${DB_LOCATION}/${table}/${memo_id}"
}

memo_create() {
    local row_id="$(uuidgen)"
    row_path="$(memo_make_filepath "${row_id}")"

    mkdir "$row_path"
    touch "$row_path"/body.txt
    date >"$row_path"/timestamp.txt

    echo "$row_id"
}

cmd_memo_edit() {
    local row_id="${1}"

    if [ -z "${row_id}" ]; then
        row_id="$(memo_create)"
    fi

    local memo_filepath="$(memo_make_filepath "${row_id}")"
    $EDITOR "$memo_filepath"/body.txt
}

Similarly, cat and grep is just a matter of path construction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cmd_memo_cat() {
    local row_id="${1}"
    [ -z "${row_id}" ] && fatal "cat <memo_uuid>"

    local memo_filepath="$(memo_make_filepath "${row_id}")"

    echo "Created: $(cat "${memo_filepath}"/timestamp.txt)"
    cat "${memo_filepath}"/body.txt
}

cmd_memo_grep() {
    local table="memos"
    grep -nrs "$1" "${DB_LOCATION}/${table}/"*/body.txt
}

list is slightly more complex as it allows for limiting output and basic sorting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
cmd_memo_list() {
    local memo_dir="${DB_LOCATION}/memos"
    local is_ascending="${1:-0}"
    local list_limit="${2:-0}"

    local order_stmt=
    if [ "${is_ascending:+1}" -eq 1 ]; then
        order_stmt=" | tac"
    fi

    local limit_stmt=
    if [ "$list_limit" -gt 0 ]; then
        limit_stmt=" | head -n ${list_limit}"
    fi

    eval ls -1 -t "${memo_dir}" "$order_stmt" "$limit_stmt"
}

Why is this useful?

Continuing with a memo example, having data structured as a set of simple files allows for more tooling to be built on top of this simple note taking tool. Access is trivial, parsing is simple. Since the data is accessible from the terminal it’s easy to transform it to other format as well. This opens a lot of opportunities for further experimentation.