Objective/Short Description

The objective is to create a solution that bypasses some of the problems with the current amanda setup. For starters the disk space problems present in tridge at the moment. In 20130610 ops meeting it was proposed that the NFS/iSCSI shares on the Netapps could be used to solve the problem stated above but it was quickly pointed out that both NFS and iSCSI communications are unencrypted. At the same time there are possible concerns with the state of the backups being unencrypted on the end disks as well. We could use encrypting file systems either at block level (iSCSI) or filesystem level (eCryptFS) to solve the problems above. However that would cause problems like encryption key handling, leak of information (filesystem names in ecryptfs case) and the possible loss of all encrypted data due to the SPOF that the backup server is all of which given the specific problem in hand could be avoided. Given all that I proposed that we use bacula who has inherent encryption both for communications and storage, no information leaking and the capability for a master key allowing decryption of encrypted data.

Generics

Bacula Architecture

The following png probably illustrates the bacula architecture better than words

http://www.bacula.org/2.4.x-manuals/en/main/bacula-applications.png

A couple of notes:

There is one Director Daemon only.

There may be multiple Storage Daemons (or SD for short) (for example one per datacenter)

There is going to be one File Daemon (or FD for short) per machine to be backed-up.

All communications (indicated by arrows in the PNG) can be encrypted.

There are passwords that authenticate each party to all the others. TLS/SSL can be used in addition.

The data store can be Tapes, Files, DVDs, Diskettes. All are called Volumes. The specifics of each medium is abstracted by bacula in day to day operations.

The SQL server stores the catalog. It is used as the fist place where information should be sought when needed. However it is not the primary source of information. This resides depending on case in the Volumes, configuration files and bootstrap files [1].

Below i try to explain the various concepts of bacula very quickly.

Jobs

Jobs are the essential unit of activity in Bacula. Whatever bacula does is a job. Whether it backups, restores, verifies a backup or just moves things around in its volumes/pools it is defined as a Job. Jobs are quite flexible allowing to run arbitrary commands before and after a backup as well as supporting file level deduplication,verification of backups, multiple storage destinations and pools

Jobdefs

Since jobs have way too many attributes that can be defined, jodefs (short for job defaults) work as a way of storing all the standard attributes that don't change between jobs and that keep job definitions short.

Levels

Backup levels are:

Full (backup everything specified)
Differential (backup the changes from the previous Full)
Incremental(backup the changes from the previous Full, Differential or Incremental)

Schedules

A schedule defines when a job will take place. it support various formats for defined the "when". It also has the possibility to override some of a Jobs defined attributes. This is heavily used for definining the levels easily and in an understandable way. For example

   Schedule {
      name="mysched"
      Run= Level=Full 1st Sat at 06:00
      Run= Level=Differential 3rd Sat at 06:00
      Run= Level=Incremental sun-fri at 07:00
  }

Filesets

These define what should be backed up and what not. They work by including a directory (or File) and recursing under that backing up everything. The possibility of exclusions does exit, either by filtering out by name or wildcard, regex etc. Generally filesets do not span filesystems in order to avoid backing up by default filesystem like sysfs or proc but this can be turned off (provided you know what you are doing). Sparse file support exists as well as the whole block device support

Volumes

Volumes is what the data get's stored in. Mostly an abstraction layer for hiding device specific behaviour from the other components. It can be tapes or files it can also be DVDs, diskettes or even FIFOs. Volumes have unique IDs called labels. A volume can be labelled either manually or preferably automatically either through an autochanger (in the case of Tape libraries) or internally by bacula.

Pools

Pools are just aggregates of volumes. They exist mostly so that jobs can span more than one volume (very useful feature). They are the destination point for backups hiding the volume specifics from the rest of the configuration. Pools need that all of their volumes are of the same type

Encryption

There are a number of communication channels in a standard bacula setup as shown in architecture. All of them can be configured to be encrypted independently of the others. Please do note that we are talking about communications here and not storage so we are talking about encryption of the TCP connection (yes that means SSL/TLS). These are:

Control channels. All paths in the architecture diagram starting starting from the Director or going to the Director are control channels. The main reason these should be encrypted is to avoid leak of the username/password used by the director to authenticate itself to the other daemons, since if these leak, impersonation of the director becomes possible (and relatively easy). Also control channels carry client's (backed-up server) file metadata and that should be protected as well.
Data channels. Paths for the communication between the Storage Daemon and the File Daemon. These contain the actual data. No reason explaining why they should be encrypted (there is however a reasoning behind not encrypting this, see below)

Furthermore File Daemon can be configured to send their data encrypted to the Storage Daemon. In that case the actual data never leaves the client unencrypted and is stored encrypted at the end medium (Tape, Disk, DVD or diskette). In this case the data path could be already considered encrypted so another layer of encryption at the communications layer is quite possible unnecessary (TODO: confirm this). The data is encrypted using the private key of an SSL certificate and can only be decrypted with that key or a Master key.

SPOFs

The following is a documentation of the various places where problems might occur

The director. Indeed a SPOF. No multiple directors are allowed at this point and the hostname is the username in control channels. Failure of the director will cause all backups and restore to not be possible. Reinstalling a new director is however relatively easy.
The catalog. A standard MySQL server. We could have a hot-standby slave to avoid a SPOF. Backups running during failover will fail.
The storage daemon. Multiple storage daemons can exist albeit they do different jobs. The failure of a storage daemon will lead to all backups and restores associated with that daemon to fail. The same problem with the director regarding the hostname/password scheme exists. Reinstalling a new storage is however relatively easy.
The data store. NAS, Tape Library, DVD/CD burner etc. A major SPOF from a hardware perspective. Bacula can not do anything about it. But since we will rely on Netapps for the data store we will use their HA.

WMF specifics

This is WIP as of 20130627

Proposed Architecture

A proposed solution is to use a server in EQIAD as a director and storage daemon. Then we allocate and NFS export one or more Volumes from the Netapps for the data backend. The fact that the data will already be encrypted before even reaching the storage daemon means that we should have no problem with the unencrypted NFS channel. Plus we won't need to ever wipe at least those specific disks in the Netapp. The clients should also use encrypted control channel for communication with the director daemon and the storage daemon. Since everything will be encrypted on the data channel we should avoid double encrypting it.

Off-site backups

Off-site backups are created by using Netapp's snapmirror for sending data to the other DC easily. We already have the snapmirror license and this solution works. Filesystems at the backup Netapp are read-only.

What to backup

For now just mirror the already in place backup. Revisit the issue later, probably on a case by case basis?

DB Backups

After a lot of talks with Asher and Sean we have ended up with a scheme using Percona's xtrabackup together with pigz to dump the entire innodb tablespace, compress it and pipe them to bacula. Restoration is going to be more difficult since the backup needs to be prepared in xtrabackup parlance and the service restarted.

Configuration Management

Everything must be done via puppet. There is a puppet module for this and role classes for director and storage daemon.

Adding a new client

In the director (if needed)

role::backup::director class and add:

bacula::director::fileset { 'myfs': 
   includes = [ '/a/backup',],
}

The above may very well be there because of another server having the same fileset. The myfs variable should be noted though because it will be used below. myfs should not contain forward or backward slashes

In the client

class { 'backup::host':
   sets => ['myfs',]
}

Backup Strategy

Two autocreated volume, autolabeled file-backed pools storing all levels in the first one (production). An archival one for historical purposes exists as well

Operations

Handy cheatsheet: https://workaround.org/bacula-cheatsheet

Day to day

Nothing.

Monitoring/Statistics

To be created

Restore (aka Panic mode)

ssh to helium and:

bconsole
restore
select from the menu the desired case (Most often 5: Most recent backup for a client)
Select the server
Choose the FileSet to be restored
Use the new prompt to browse the bvfs (bacula virtual filesystem) if file metadata has not been expired from the database. Standard ls, cd commands apply. mark the files/dirs you want restored. If you specified a date old enough you will not be able to browse and you will have to restore the entire fileset
use the "mark" command to mark files you want to be restored. wildcards work, there is also "unmark"
enter done
modify the job if needed (for example change the destination directory)
wait :-)
fetch your backups from /var/tmp/bacula-restores (on the client)

Bare metal recovery

There is a paid plugin by bacula system to allow baremetal recovery. However doing it manually is also relatively easy. It is quite straightforward as a procedure. It is roughly described below

Boot with your Rescue Live CDROM.
Start the Network.
Re-partition your hard disk(s) as it was before (we are going to be dumping them via sfdisk maybe?)
Re-format your partitions
Install bacula-fd
Perform a Bacula restore of all your files
Re-install your boot loader
Reboot