Honing the craft.

You are here: You're reading a post

Building a Duplicity configuration profile manager

As mentioned in my earlier post on migrating Owncloud to Docker, I am in the progress of moving all my services and development projects into Docker. In order to have a resilient setup, this also means, that I have to reconfigure my automated backup system. So far, I have been using a custom Bash script to do the backups on a few - *nix - systems. Now however, I am considering wrapping a Python program around Duplicity - which is my primary backup tool - to make the maintenance of my configuration easier.

We have been using Duplicity in the last few years for our backup needs. This includes work laptops and a microserver in the home office and also other laptops in use by family members. Before that, I have been using macOS' Time Machine and other commercial software, but since I wanted to have remote - preferably off site - backups for all systems and these solutions didn't support all of what I needed, or not the way, that I wanted, I decided to look for something different. We only have one site - our home - and I was looking for cloud solutions for the off site backup, hence encryption was a priority.

After reviewing the solutions available for encrypted remote backups, preferably from the open source stable, it pretty much boiled down to two contenders, Bacula and Duplicity. Bacula is a very feature rich, well rounded backup solution, that can satisfy large enterprise-like environments, especially due to its centralized management. This also means however, that it has various components that have to be managed and it is a relatively complex system to maintain. So it felt like an overkill for our simple <10 node environment. I looked into Duplicity and it had all the features I was looking for; namely:

  • Several remote protocols and cloud service providers are supported.
  • Backup archives are locally encrypted.
  • Support for incremental backups and deduplication. The original (full) backup is retained an later changes are "layed over" the original backup to give the complete chain to a certain backup state.

Duplicity delivers all this in a simple command line tool, but this also means, that there is no framework for managing the configurations for the various sources (hosts, paths) that you want to back up. Hence I have created a "frontend" Bash script to execute my backup jobs in a more convenient way. This tiny framework consists of a configuration file containing the environment variables forming the configuration, and a script, that executes the backup jobs based on the configuration sourced from the configuration file. To have everything in one place, I have lumped the configuration for all the hosts to be backed up in a single file and I have commented out the parts not relevant for the given host. Surely, this is not a very clean solution and it doesn't look nice, however it helped to keep the script itself very simple and not go over the boundaries of what a shell script is best suited for. As the Google Shell Style Guide prescribes:

"If you are writing a script that is more than 100 lines long, you should probably be writing it in Python instead."

I would probably put the limit more around 200+ lines, but you get the idea.

Changes required to accommodate the containers

Moving stuff into containers on my microserver diversifies the kinds of backup I need to take and keep tabs of. At least, compared to having to back up a few path locations on my microserver, I now have at least two kinds of backups on my hands:

  • The same "native" local paths, that are backed up in the host OS environment.
  • Docker volumes, that have to be mounted into a container and the backup has to be run inside the container.

While the first kind is essentially the same thing as before, currently I need a management script and potentially even a custom Docker image to accommodate the backup tasks required for the Docker volumes efficiently. This means, that I either extend my existing script with additional backup tasks and maybe even make changes to adapt it to the container environment or find a different solution. Since keeping track of the configuration sets already feels like a pain in the back, I definitely don't want to stretch the existing script and configuration further, hence, like I also did when I started to use Duplicity, I looked at the "frontend" options for Duplicity again, to make managing the backup configurations easier. After looking into the options I found, here are the pros and cons for the most popular ones from my perspective:

Extending my own (Bash) shell script framework:

  • Pros
    • Full customization
    • Flexible
  • Cons
    • Configuration and code has to be mixed to keep the script simple.
    • I have to handle multiple configuration sets for different systems. If a master configuration is written to handle multiple profiles, I need to uncomment the sections that are relevant for the given backup. Otherwise, to make the script intelligent enough to handle the selection of backup configuration, the code base will get too big for a shell script.

Duply

  • Pros
    • The solution is ready.
    • Reduces the complexity of the backup command to run.
  • Cons
    • Not flexible enough to configure all the settings I need in the profile.
    • Maintaining one profile for each path to be backed up isn't convenient.

Déjà dup

  • Pros
    • It has a nice and simple configuration GUI.
    • The same backup configuration can be used for multiple backup paths.
  • Cons
    • Its philosophy is simplicity, not flexibility, hence it is not able to handle multiple backup profiles.
    • GUI oriented, running and configuring from the CLI is possible, but is not the main focus (not so good for headless servers).

Since none of the existing solutions that I have reviewed for Duplicity has the ability to handle multiple backup profiles, with each profile handling multiple paths, it looks like I have two choices: either extend my current script to do what I want, or implement the solution I need in a language geared towards creating larger code bases than what makes sense with a shell script.

The solution I am working on

Based on my findings above, I have decided to create a frontend for Duplicity written in Python, that solves my problem, but that is also generalized enough, so that it could be released and used by other people, that are looking for a similar feature set. The main focus of my solution would be the following:

  • Backup groups: lump together backup sources that belong together (same host, container, etc.) and have one general configuration for each backup source in the group.
  • Additional configuration options: make the configuration of additional properties, like volume sizes possible for each group.
  • CLI oriented configuration (YAML config file)
  • The fact that it is written in Python would allow easier extension or deeper integration with Duplicity if the need arises in the future.
  • Features will be released solely based on my needs, until those are covered. Then, if there are other people interested, I am planning to implement features requested by the community as well, as time permits.

The data (configuration) structure determines and reveals most of the philosophy and logic, so I have started to create a draft of the configuration structure:

backup_groups:
  my_local_backups:
    encrypted: no
    backup_type: local
    volume_size: 200
    origins:
      /var/www/html:
        backup_path: /root/backups/var/www/html
        restore_path: /root/restored/var/www/html
      /home/tommy:
        backup_path: /root/backups/home/tommy
        restore_path: /root/restored/home/tommy
  my_s3_backups:
    encrypted: yes
    backup_type: s3
    backup_uri: s3://s3.sa-east-1.amazonaws.com/my-backup-bucket
    aws_access_key: xxxxxx
    aws_secret_key: xxxxxx
    gpg_key: xxxxxx
    gpg_passphrase: xxxxxx
    volume_size: 50
    origins:
      /etc:
        backup_path: /etc
        restore_path: /root/restored/etc
      /home/shared:
        backup_path: /home/shared
        restore_path: /root/restored/home/shared
  my_scp_backups:
    encrypted: no
    backup_type: scp
    backup_uri: scp://myscpuser@host.example.com/
    volume_size: 200
    origins:
      /home/fun:
        backup_path: /home/fun
        restore_path: /root/restored/home/fun
      /home/katy:
        backup_path: /home/katy
        restore_path: /root/restored/home/katy

For now, the keys, passphrases and passwords are included in the configuration file, however it is on my list to create some generalized concept to integrate the configuration with Python Keyring the simplest way. My policy is that no secrets should be stored in plain text files, unless absolutely necessary, so this will only stay like this until I get the basic features in place.

What are your thoughts on this? I am reinventing the wheel somehow? Is there an open source solution out there, that I have missed? Please reach out to me via Twitter or email with any thoughts, they would be greatly appreciated. I will also reach out to the Duplicity team to see if I can get some input from their side on any pitfalls my approach has or ideas to potentially improve the concept. I will share the feedback with you if I get any.