I like the way dvc (Archived) allows you to manage the reproducibility of pipelines, but for my PhD I found myself multiple times in the need to execute a single pipeline multiple times with different parameters and to collect data in an homogeneous way during the multiple runs, so I ended up abusing dvc a bit for my goals.

Raising the issue with a friend, he told me that he uses an SQLite (Archived) database stored under git and an ORM (Archived) to write records in it. I liked the solution, except for the part that makes you store binary data inside of git.

So I searched for a way to make the databases behave like plain text files under git, but to be fully functional on checkout. It turns out that it is possible, as well as easy to do, and you also enjoy line-wise diffs of your databases!

The trick is the usage of git filters (Archived) to operate on the files when they are checked out and when they are added to the index. To set this up, open your .gitconfig (either under your home directory or in the root of one repository1) and add the following filter section, which declares a mandatory filter that dumps and recreates an SQLite database2.

[filter "sqlite3"]
    clean = "f() { tmpfile=$(mktemp); cat - > $tmpfile; sqlite3 $tmpfile .dump; rm $tmpfile; }; f"
    smudge = "f() { tmpfile=$(mktemp); sqlite3 $tmpfile; cat $tmpfile; rm $tmpfile; }; f"
    required = true

Now in your repository .gitattributes you can set the files extensions which will undergo treatment by the declared filter.

*.db filter=sqlite3
*.sqlite3 filter=sqlite3

These will also take care of displaying the SQLite dump when doing a diff, so you don't need to add a diff filter to the attributes.

If you used the .gitconfig in the repository root, you also have to instruct git to include that file in its configuration:

git config --local include.path ../.gitconfig

If you want to set up automatic .gitconfig loading for all of your repositories then you can change --local with --global, but be aware that there are security risks involved, since cloning any repository may modify your git configuration.

Finally, if someone without the filters set up clones the repository, he will get an error about a missing filter. He should then execute the git config command above to install the configuration and then do a new checkout of the repository, which will apply the newly installed filters.

git stash save -u
rm .git/index
git checkout HEAD -- "$(git rev-parse --show-toplevel)"
git stash pop

Notice that you need to explicitly delete the git index to do this.

With all of this set up, you can enjoy the flexibility of storing data into SQLite with all the comfort of it working seamlessly with version control.

Footnotes

  1. If you change the .gitconfig in your home directory, each person with which you collaborate has to copy it. On the other hand, if you change that in the repository root, each of them has to instruct their git configuration to load .gitconfig files found in certain repositories, but they can enjoy automatic updates when someone else adds functionality to that file.

  2. Notice the creation of a temporary file, since git filters are required to read the file from stdin and output the result on stdout, and sqlite3 can't do this, so we first have to save the input somewhere.