Encrypting and auto-boot-decryption of an LXC zpool on Ubuntu with LUKS

Image result for luks key list

So we have seen some postings online that suggested you can’t encrypt an lxd zpool, such as this GitHub posting here, which correctly explains that an encrypted zpool that doesn’t mount at startup disappears WITHOUT WARNING from your lxd configuration.

It’s not the whole picture as it IS possible to so encrypt an lxd zpool with luks (the standard full disk encryption option for Linux Ubuntu) and have it work out-of-the-box at startup, but perhaps it’s not as straightforward as everyone would like

WARNING WARNING – THE INSTRUCTIONS BELOW ARE NOT GUARANTEED.  WE USE COMMANDS THAT WILL WIPE A DRIVE SO GREAT CARE IS NEEDED AND WE CANNOT HELP YOU IF YOU LOSE ACCESS TO YOUR DATA.  DO NOT TRY THIS ON A PRODUCTION SERVER.  SEEK PROFESSIONAL HELP INSTEAD, PLEASE!

With that said…this post is for those who, for example, have a new clean system that they can always do-over if this tutorial does not work as advertised.  Ubuntu OS changes and so the instructions might not work on your particular system.

Firstly, we assume you have your ubuntu 16.04 installed on a luks encrypted drive (i.e. the standard ubuntu instal using the “encrypt HD” option).  This of course requires you to enter a password at boot-up to decrypt your system, something like:Image result for ubuntu full disk encryption

We assume you have a second drive that you want to use for your linux lxd containers.  That’s how we roll our lxd.

So, to setup an encrypted zpool, select your drive to be used (we assume it’s /dev/sdd here, and we assume it’s a newly created partition that is not yet formatted – your drive might be /dev/sda, /dev/sdb or something quite different – MAKE SURE YOU GET THAT RIGHT).

Go through the normal luks procedure to encrypt the drive:

sudo cryptsetup -y -v luksFormat /dev/sdd

Enter the password and NOTE THE WARNING – this WILL destroy the drive contents.  #YOUHAVEBEENWARNED

Then open it (CHANGE /dev/sdd and sdd_crypt to yout drive name credentials!):

sudo cryptsetup luksOpen /dev/sdd sdd_crypt

Normally, you would create your normal file system now, such as an ext4, but we don’t do that.  Instead, create your zpool (we are calling ours ‘lxdzpool’ – feel free to change that to ‘tank’ or whatever pool name you prefer):

sudo zpool create -f -o ashift=12 -O normalization=formD -O atime=off -m none -R /mnt -O compression=lz4 lxdzpool  /dev/mapper/sdd_crypt

And, believe it or not, there you have an encrypted zpool.  Add it to lxd using the standard ‘sudo lxd init’ or ‘sudo zpool create’ etc. procedure that you need to go through to create lxc containers, then start launching your containers and voila, you are using an encrypted zpool.

So, we are not done yet.  We can’t let the OS boot up without decrypting the zpool drive, lest our containers disappear and lxd goes back to using a directory for its zpool, per the GitHub posting referred to above.  That would not be good.  So how do we make sure this is auto-decrypted at boot-up (which is needed for lxc containers to launch)?

Well, we have to create a keyfile that is used to decrypt this drive after you decrypt the main OS drive (so you do still need to decrypt your PC at bootup as usual – as above).  Polite reminder: again, change /dev/sdd to your drive name credentials:

sudo dd if=/dev/urandom of=/root/.keyfile bs=1024 count=4
sudo chmod 0400 /root/.keyfile
sudo cryptsetup luksAddKey /dev/sdd /root/.keyfile

This creates  keyfile at /root/.keyfile.  This file is used to decrypt the zpool drive.  Just answer the prompts that these commands generate (self explanatory).

Now find out your disks UUID number with:

sudo blkid

This should give you a list of your drives with various information.  We need the long string that comes after “UUID=…” for your drive, e.g.:

/dev/sdd: UUID=”971bf7bc-43f2-4ce0-85aa-9c6437240ec5″ TYPE=”crypto_LUKS”

Note we need the UUID – not the PARTUUID or anything else.  It must say “UUID=…”.

Now edit /etc/crypttab as root:

sudo nano /etc/crypttab

And add an entry like this (another polite reminder: again, change sdd_crypt to your drive name credentials)

#Add entry to aut-unlock the encrypted drive at boot-up,
#after the main OS drive has been unlocked
sdd_crypt UUID=971bf7bc-43f2-4ce0-85aa-9c6437240ec5 /root/.keyfile luks,discard

And now reboot.  You should see your familiar boot-up screen for decrypting your ubuntu OS.  And once you enter the correct password, the encrypted zfs zpool drive will be automatically decrypted and will allow lxd to access it as your zpool.  Here’s an excerpt from our ‘lxc info’ output AFTER a reboot.  We highlighted the most important bit for this tutorial:

$ lxc info
config:
storage.zfs_pool_name: lxdzpool
api_extensions:
– id_map
– id_map_base
– resource_limits
api_status: stable
api_version: “1.0”
auth: trusted
auth_methods: []
public: false
driver: lxc
driver_version: 2.0.8
kernel: Linux
kernel_architecture: x86_64
kernel_version: 4.15.0-34-generic
server: lxd
storage: zfs

Note we are using our ‘lxdzpool’.

We hope this is useful.  GOOD LUCK!

Useful additional reference materials are here (or at least, they were here when we posted this article):

Encrypting a second hard drive on Ubuntu (post-install)

Setting up ZFS on LUKS

LXC Container Migration – WORKING

So we found a spare hour at a remote location and thought we could tinker a little more with lxc live migration as part of our LXD experiments.

Related image

We executed the following in a terminal as NON-ROOT users yet again:

lxc copy Nextcloud EI:Nextcloud-BAK-13-Sep-18

lxc start EI:Nextcloud-BAK-13-Sep-18

lxc list EI: | grep Nextcloud-BAK-Sep-13-Sep-18

And we got this at the terminal (a little time later…)

| Nextcloud-BAK-13-Sep-18 | RUNNING | 192.168.1.38 (eth0) | | PERSISTENT | 0 |

Note that this is a 138GB file.  Not small by any standard.  It holds every single file that’s important to our business (server-side AND end-to-end encrypted of course).  That’s a big file-copy.  So even at LAN speed, this gave us enough time to make some really good coffee!

So we then modified our front-end haproxy server to redirect traffic intended for our primary cloud-file server to this lxc instance instead. (Two minor changes to a config, replacing the IP address of the current cloud to the new cloud).  Then we restarted our proxy server and….sharp intake of breath…

IT WORKED BEAUTIFULLY!

Almost unbelievably, our entire public-facing cloud server was now running on another machine (just a few feet away as it happens).   We hoped for this, but we really did not expect a 138GB file to copy and startup first time.  #WOW

We need to test and work this instance to death to make sure it’s every bit as SOUND as our primary server, which is now back online and this backup version is just sleeping.

Note that this is a complete working copy of our entire cloud infrastructure – the Nextcloud software, every single file, all the HTTPS: certs, databases, configurations, OS – everything.  A user changes NOTHING to access this site, in fact, it’s not even possible for them to know it’s any different.

We think this is amazing, and is a great reflection of the abilities of lxc, which is why we are such big fans,

With this set-up, we could create working copies of our servers in another geo-location every, say, month, or maybe even every week (once a day is too much for a geo-remote facility – 138GB for this one server over the intenet?  Yikes).

So yes, bandwidth needed IS significant, and thus you can’t flash the larger server images over the internet every day, but it does provide for a very resistant disaster-recovery situation: if our premises go up in a Tornado, we can be back online with just a few clicks from a web-browser (change DNS settings and maybe a router setting or two) and issue a few commands from an ssh terminal, connected to the backup facility.

We will develop a proper, sensible strategy for using this technique after we have tested it extensively, but for now, we are happy it works.  It gives us another level of redundancy for our updating and backup processes.

GOTTA LOVE LXD

Image result for love LXD

Why LXC

We deploy LXC containers in quite literally ALL of our production services.  And we have several development services, ALL of which are in LXC containers.

Why is that?  Why do we use LXC?

The truthful answer is “because we are not Linux experts”.  It really is true.  Almost embarrassingly so in fact.  Truth is, the SUDO command scares us: it’s SO POWERFUL that you can even brick a device with it (we know.  We did it).

We have tried to use single machines to host services.  It takes very little resources to run a Linux server, and even today’s laptops have more than enough hardware to support even a mid-size business (and we are not even mid-size).  The problem we faced was that whenever we tried “sudo” commands in Linux Ubuntu, something at sometime would go wrong – and we were always left wondering if we had somehow created a security weakness, or some other deficiency.  Damn you, SuperUser, for giving us the ability to break a machine in so many ways.

We kept re-installing the Linux OS on the machines and re-trying until we were exhausted.  We just could not feel COMFORTABLE messing around with an OS that was already busy dealing with the pervasive malware and hacker threats, without us unwittingly screwing up the system in new and novel ways.

And that’s when the light went on.  We thought: what if we could type in commands without worrying about consequences?  A world where anything goes at the command line is…heaven…for those that don’t know everything there is to know about Linux (which of course definitely includes us!).  On that day, we (re-) discovered “virtual machines”.  And LXC is, in our view, the BEST, if you are running a linux server.

LXC allows us to create virtual machines that use fewer resources than the host; machines that run as fast as bare-metal servers (actually, we have measured them to be even FASTER!).  But more than that, LXC with its incredibly powerful “snapshot” capability allows us to play GOD at the command line, and not worry about the consequences.

Because of LXC, we explore new capabilities all the time – looking at this new opensource project, or that new capability.  And we ALWAYS ALWAYS run it in an unprivileged LXC container (even if we have to work at it) because we can then sleep at night.

We found the following blog INCREDIBLY USEFUL – it inspired us to use LXC, and it gives “us mortals” here at Exploinsights, Inc. more than enough information to get started and become courageous with a command line!  And in our case, we have never looked back.  We ONLY EVER use LXC for our production services:

LXC 1.0: Blog post series [0/10]

We thank #UBUNTU and we thank #Stéphane Graber for the excellent LXC and the excellent development/tutorials respectively.

If you have EVER struggled to use Linux.  If the command line with “sudo” scares you (as it really should).  If you want god-like forgiveness for your efforts to create linux-based services (which are BRILLIANT when done right) then do yourself a favor: check out LXC at the above links on a clean Ubuntu server install.  (And no, we don’t get paid to say that).

We use LXC to run our own Nextcloud server (a life saver in our industry).  We operate TWO web sites (each in their own container), a superb self-hosted OnlyOffice document server and a front-end web proxy that sends the traffic to the right place.  Every service is self-CONTAINED in an LXC container.  No worries!

Other forms of virtualisation are also good, probably.  But if you know of anything as FAST and as GOOD as LXC…then, well, we are surprised and delighted for you.

Regards,

SysAdmin (Administrator@exploinsights.com)

 

 

 

Interactively Updating LXC containers

We love our LXC containers.  They make it so easy to provide and update services – snapshots take most of the fear out of the process, as we have discussed previously here.  But even so, we are instinctively lazy and are always looking for ways to make updates EASIER.  Now it’s possible to fully automate the updating of a running service in an LXC container BUT a responsible administrator wants to know what’s going on when key applications are being updated.  We created a compromise, a simple script that runs an interactive process to backup and update our containers.  It saves us repetitively typing the same commands, but it still keeps us fully in control as we answer yes/no upgrade related questions.  We thought our script is worth sharing.  So, without further ado, here’s our script, which you can just copy and paste to a file in your home directory (called say ‘update-containers.sh’).  Then just run the script when you want to update and upgrade your containers.  Don’t forget to change the name(s) of your linux containers in the ‘names=…’ line of the script:

#!/bin/bash
#
# Simple Container Update Script
# Interactively update lxc containers

# Change THESE ENTRIES with container names and remote path:
#
names='container-name-1 c-name2 c-name3 name4 nameN'

# Now we just run a loop until all containers are backed up & updated
#
for name in $names
do
echo ""
echo "Creating a Snapshot of container:" $name
lxc snapshot $name
echo "Updating container:" $name
lxc exec $name apt update
lxc exec $name apt upgrade
lxc exec $name apt autoremove
echo "Container updated. Re-starting..."
lxc restart $name
echo ""
done
echo "All containers updated"

Also, after you save it, don’t forget to chmod the file if you run it as a regular script:

chmod +x container-update.sh

Now run the script:

./container-update

Note – no need to run using ‘sudo’ i.e. as ROOT user- this is LXC, we like to be run with minimal privileges so as not to ever break anything important!

So this simple script, which runs in Ubuntu or equivalex distro, does the following INTERACTIVELY for every container you name:

lxc snapshot container #Make a full backup, in case the update fails
apt update             #Update the repositories
apt upgrade            #Upgrade everything possible
apt autoremove         #Free up space by deleting old files 
restart container      #Make changes take effect

This process is repeated for every container that is named.  The ‘lxc snapshot’ is very useful: sometimes an ‘apt upgrade’ breaks the system.  In our case, we can then EASILY restore the container to its prior updated state using the ‘lxc restore command.  All you have to do is firstly find out a containers snapshot name:

lxc info container-name

E.g. – here’s the output of ‘lxc info’ on one of our real live containers:

sysadmin@server1:~lxc info office

Name: office
Remote: unix://
Architecture: x86_64
Created: 2018/07/24 07:02 UTC
Status: Running
Type: persistent
Profiles: default
Pid: 21139
Ips:
eth0: inet 192.168.1.28
eth0: inet6 fe80::216:3eff:feab:4453
lo: inet 127.0.0.1
lo: inet6 ::1
Resources:
Processes: 198
Disk usage:
root: 1.71GB
Memory usage:
Memory (current): 301.40MB
Memory (peak): 376.90MB
Network usage:
eth0:
Bytes received: 2.52MB
Bytes sent: 1.12MB
Packets received: 32258
Packets sent: 16224
lo:
Bytes received: 2.81MB
Bytes sent: 2.81MB
Packets received: 18614
Packets sent: 18614
Snapshots:
http-works (taken at 2018/07/24 07:07 UTC) (stateless)
https-works (taken at 2018/07/24 09:59 UTC) (stateless)
snap1 (taken at 2018/08/07 07:37 UTC) (stateless)

The snapshots are listed at the end of the info screen.  This container has three: the most recent being called ‘snap1’.  We can restore our container to that state by issuing:

lxc restore office snap1

…and then we have our container back just where it was before we tried (and hypothetically failed) to update it.   So we could do more investigating to find out what’s breaking and then take corrective action.

The rest of the script is boiler-plate linux updating on Ubuntu, but it’s interactive in that you still have to accept proposed upgrade(s) – we call that “responsible upgrading”.  Finally, each container is restarted so that the correct changes are propagated.  This gives a BRIEF downtime of each container (typically 1-several seconds).  Don’t do this if you cannot afford even a few seconds of downtime.

We run this script manually once a week or so, and it makes the whole container update process less-painful and thus EASIER.

Happy LXC container updating!

Installing OnlyOffice Document Server in an Ubuntu 16.04 LXC Container

In our quest to migrate away from the relentlessly privacy-mining Microsoft products, we have discovered ‘OnlyOffice’ – a very Microsoft-compatible document editing suite.  Onlyoffice have Desktop and server-based versions, including an Open Source self-hosted version, which scratches a LOT of itches for Exploinsights, Inc for NIST-800-171 compliance and data-residency requirements.

If you’ve ever tried to install the open-source self-hosted OnlyOffice document server (e.g. using the official installation instructions here) you may find it’s not as simple as you’d like.  Firstly, per the official instructions, the onlyoffice server needs to be installed on a separate machine.  You can of course use a dedicated server, but we found that for our application, this is a poor use of resources as our usage is relatively low (so why have a physical machine sitting idly for most of the time?).  If you try to install onlyoffice on a machine with other services to try to better utilise your hardware, you can quickly find all kinds of conflicts, as the onlyoffice server uses multiple services to function and things can get messed up very quickly, breaking a LOT of functionality on what could well be a critical asset you were using (before you broke it!).

Clearly, a good compromise is to use a Virtual Machine – and we like those a LOT here at Exploinsights, Inc.  Our preferred form of virtualisation is LXD/LXC because of performance – LXC is blindingly fast, so it minimizes user-experience lag issues.  There is however no official documentation for installing onlyoffice in an lxc container, and although it turns out to be not straightforward, it IS possible – and quite easy once you work through the issues.

This article is to help guide those who want to install onlyoffice document server in an LXC container, running under Ubuntu 16.04.  We have this running on a System76 Lemur Laptop.  The onlyoffice service is resource heavy, so you need a good supply of memory, cpu power and disk space.  We assume you have these covered.  For the record, the base OS we are running our lxc containers in is Ubuntu 16.04 server.

Pre-requisites:

You need a dns name for this service – a subdomain of your main url is fine.  So if you own “mybusiness.com”, a good server name could be “onlyoffice.mybusiness.com”.  Obviously you need dns records to point to the server we are about to create.  Also, your network router or reverse proxy needs to be configured to direct traffic for ports 80 and 443 to your soon-to-be-created onlyoffice server.

Instructions:

Create and launch a container, then enter the container to execute commands:

lxc launch ubuntu:16.04 onlyoffice
lxc exec onlyoffice bash

Now let’s start the installation.  Firstly, a mandatory update (follow any prompts that ask permission to install update(s)):

apt update && apt upgrade && apt autoremove

Then restart the container to make sure all changes take effect:

exit                     #Leave the container
lxc restart onlyoffice   #Restart it
lxc exec onlyoffice bash #Re-enter the container

Now, we must add an entry to the /etc/hosts file (lxc should really do this for us, but it doesn’t, and only office will not work unless we do this):

nano /etc/hosts  #edit the file

Adjust your file to change from something like this:

127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

To (see bold entry):

127.0.0.1 localhost
127.0.1.1 onlyoffice.mybusiness.com

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

save and quit.  Now we install postgresql:

apt install postgresql

Now we have to do somethings a little differently than at a regular command line because we operate as a root user in lxc.  So we can create the database using these commands:

su - postgres

Then type:

psql
CREATE DATABASE onlyoffice;
CREATE USER onlyoffice WITH password 'onlyoffice';
GRANT ALL privileges ON DATABASE onlyoffice TO onlyoffice;
\q
exit

We should now have a database created and ready for use.  Now this:

curl -sL https://deb.nodesource.com/setup_6.x | bash -
apt install nodejs
apt install redis-server rabbitmq-server
echo "deb http://download.onlyoffice.com/repo/debian squeeze main" |  tee /etc/apt/sources.list.d/onlyoffice.list
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys CB2DE8E5
apt update

We are now ready to install the document server.  This is an EXCELLENT TIME to take  a snapshot of the lxc container:

exit
lxc snapshot onlyoffice pre-server-install

This creates a snapshot that we can EASILY restore another day.  And sadly, we probably have to as we have yet to find a way of UPDATING an existing document-server instance, so whenever onlyoffice release an update, we repeat the installation from this point forward after restoring the container configuration.

Let’s continue with the installation:

apt install onlyoffice-documentserver

You will be asked to enter the credentials for the database during the install.  Type the following and press enter:

onlyoffice

Once this is done, if you access your web site (i.e. your version of ‘www.onlyoffice.mybusiness.com’) you should see the following screen:

We now have a document server running, albeit in http mode only.  This is not good enough, we need to use SSL/TLS to make our server safe from eavesdroppers.  There’s a FREE way to do this using the EXCELLENT LetsEncrypt service, and this is how we do that:

Back to the command line in our lxc container.  Edit this file:

nano /etc/nginx/conf.d/onlyoffice-documentserver.conf

Delete everything there and change it to the following (changing your domain name accordingly):

include /etc/nginx/includes/onlyoffice-http.conf;
server {
  listen 0.0.0.0:80;
  listen [::]:80 default_server;
  server_name onlyoffice.mybusiness.com;
  server_tokens off;

  include /etc/nginx/includes/onlyoffice-documentserver-*.conf;

  location ~ /.well-known/acme-challenge {
        root /var/www/onlyoffice/;
        allow all;
  }
}

Save and quit the editor.  Then exeute:

systemctl reload nginx
apt install letsencrypt

And then this, changing the email address and domain name to yours:

letsencrypt certonly --webroot --agree-tos --email youremail@address.com -d onlyoffice.mybusiness.com -w /var/www/onlyoffice/

Now, we have to re-edit the nginx file”

nano /etc/nginx/conf.d/onlyoffice-documentserver.conf

…and replace the contents with the text below, changing all the bold items to your specific credentials:

include /etc/nginx/includes/onlyoffice-http.conf;
## Normal HTTP host
server {
  listen 0.0.0.0:80;
  listen [::]:80 default_server;
  server_name onlyoffice.mybusiness.com;
  server_tokens off;
  ## Redirects all traffic to the HTTPS host
  root /nowhere; ## root doesn't have to be a valid path since we are redirecting
  rewrite ^ https://$host$request_uri? permanent;
}
#HTTP host for internal services
server {
  listen 127.0.0.1:80;
  listen [::1]:80;
  server_name localhost;
  server_tokens off;
  include /etc/nginx/includes/onlyoffice-documentserver-common.conf;
  include /etc/nginx/includes/onlyoffice-documentserver-docservice.conf;
}
## HTTPS host
server {
  listen 0.0.0.0:443 ssl;
  listen [::]:443 ssl default_server;
  server_name onlyoffice.mybusiness.com;
  server_tokens off;
  root /usr/share/nginx/html;
  
  ssl_certificate /etc/letsencrypt/live/onlyoffice.mybusiness.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/onlyoffice.mybusiness.com/privkey.pem;

  # modern configuration. tweak to your needs.
  ssl_protocols TLSv1.2;
  ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
  ssl_prefer_server_ciphers on;

  # HSTS (ngx_http_headers_module is required) (15768000 seconds = 6 months)
  add_header Strict-Transport-Security max-age=15768000;

  ssl_session_cache builtin:1000 shared:SSL:10m;
  # add_header X-Frame-Options SAMEORIGIN;
  add_header X-Content-Type-Options nosniff;
 
  # ssl_stapling on;
  # ssl_stapling_verify on;
  # ssl_trusted_certificate /etc/nginx/ssl/stapling.trusted.crt;
  # resolver 208.67.222.222 208.67.222.220 valid=300s; # Can change to your DNS resolver if desired
  # resolver_timeout 10s;
  ## [Optional] Generate a stronger DHE parameter:
  ##   cd /etc/ssl/certs
  ##   sudo openssl dhparam -out dhparam.pem 4096
  ##
  #ssl_dhparam {{SSL_DHPARAM_PATH}};

  location ~ /.well-known/acme-challenge {
     root /var/www/onlyoffice/;
     allow all;
  }
  include /etc/nginx/includes/onlyoffice-documentserver-*.conf;
}

Save the file, then reload nginx:

systemctl reload nginx

Navigate back to your web page onlyoffice.mybusiness.com and you should get the following now:

And if you do indeed see that screen then you now have a fully operational self-hosted OnlyOffice document server.

If you use these instructions, please let us know how it goes.  In a future article, we will show you how to update the container from the snapshot we created earlier.

 

 

 

 

 

Server Backups using LXD

So I am working the process of server backups today.  Most people do backups wrong, and I have been guilty of that too.  You know it’s true when you accidentally delete a file, and you think ‘No worries, I’ll restore it from a backup…’; and about an hour later of opening archives and trying to extract the one file but finding some issue or other…makes you realize your backup strategy sucks.  I am thus trying to do get this right from the get-go today:
LXD makes the process easy (albeit with a few quirks).  EXPLOINSIGHTS Inc. (EI) servers are structured such that each service is running in an LXD container.  Today, there are several active, ‘production’ servers (plus several developmental servers, which are ignored in this posting):

  • Nextcloud – cloud file storage;
  • WordPress – this web-site/blog;
  • Onlyoffice – an ‘OnlyOffice’ document server;
  • Haproxy – the front-end server that routes traffic across the LAN

All of these services are running on one physical device.  They are important to EI as customers access these servers (whether they know it or not), so they need to JUST WORK.
What can I do if the single device (‘server1’) running these services just dies?  well I have battery backup, so a power glitch won’t do it.  Check.  And the modem/router are also UPS charged, so connectivity is good.  Check.  I don’t have RAID on the device but I do have new HD’s – low risk (but not great).  Half-check there.  And if the device hardware just crashes and burns just because it can…well that’s what I want to fix today:
So my way of creating functionally useful backups is to do the following, as a simple loop in a script file:

  1. For each <container name> on server1:
    1. lxc stop <container-name>
    2. lxc copy <container-name> TO server2:<container-name##>
    3. lxc restart <container-name>
  2. Next <container-name>

The ‘##’ at the end of the lxc copy command is the week-number, so I can create weekly container backups EASILY and store them on server2.  I had hoped to do this without stopping the containers, but the criu LXD add-on program (which is supposed to provide that very capability) is not performing properly on server2, so I have a brief server-outage when I run this script for each service for now.  I thus have to try to run this at “quite times”, if such a thing exists; but I can live with that for now.
I did a dry-run today: I executed the script, then I stopped two of the production containers.  I then launched the backup containers with the command:

  • lxc start <container-name##>

I then edited the LAN addresses for these services and I was operational again IN MINUTES.  The only user-experience change I noticed was my login credentials expired, but other than that it was exactly the same experience “as a user”.  Just awesome!
Such a strategy is of no use if you need 100% up-time, but this works for EI for now until I develop something better.  And to be clear, this solution is still far from perfect so it’s always going to be a work in progress:-
Residual risks include:

  1. Both servers are on same premises, so e.g. fire or theft risks are not covered;
    1. Really hard to fix this because of data residency and control requirements.
  2. This strategy requires human intervention to launch the backup servers, so there could be considerable downtime.  Starting a backup lxd container for the haproxy server will also require changes at the router (this one container receives and routes all http and https traffic except ssh/vpn connections.  The LAN router presently sends everything to this server.  A backup container will have a different LAN IP address thus router reconfig is needed);
  3. The cloud file storage container is not small – about 13GB today.  52 weeks of those will have a notable impact on storage at server2 (but storage is cheap);
  4. I still have to routinely check that the backup containers actually WORK (so practice drills are needed);
  5. I have to manually add new production containers to my script – easy to forget;
  6. I don’t like scheduled downtime for the servers…

But overall, today, I am satisfied with this approach.  The backup script will be placed in a cron file for auto-execution weekly.  I may make my script a bit more friendly by sending log files and/or email notification etc., but for now a manual check-up on backup status will suffice.