Backup Server Activation: This Was Not a Drill #LXC-Hero

Well what a time we had this week.  There we were, minding our own business, running a few standard:

sudo apt update && sudo apt upgrade

…commands as our server notified us of routine (so we thought) Ubuntu updates.  We have done this many many times.  So, what could go wrong?

After this particular update, which included a kernel change, we were given that lovely notice that says “a reboot is required to make changes take effect”.  We never like that.

We were out of the office.  But this was a security update, so it’s kinda important.  #AGONISE-A-LITTLE.  So, we went to a backup server first, and performed the same update (it was the same OS and it needed the same patches).  We remotely rebooted the backup server and it worked beautifully.  That made us feel better (#FalseSenseOfSecurity).  So, on our primary server, we issued:

sudo reboot

…at the terminal, as we had done many many times before.  As usual, the SSH connection was terminated without notice.  We don’t like that, but that’s the nature of the beast.  We waited to login into our Dropbear SSH terminal so we can remotely unlock our encrypted drives.  With some relief, it appeared!  YAY.  We typed in our usual single command and hit the return key:

unlock

We normally get a prompt for our decryption credentials.  In fact, we ALWAYS get a prompt for our decryption credentials.  #NotToday

Not only did we see something new, it was also, as far as we can google, unique for a Dropbear login:

WhiskyTangoFoxtrot (#WTF).  We are not trying to kill a process.  We are trying to unlock our multiple drives.  What is going on?  We logged back in, and got the same result.  This was not a badly typed command.  This was real.  Our primary server was down. And we mean DOWN.  The Kill process is part of the unlock script, which means the script is not working…which means the OS can’t find the primary encrypted drive.  We actually managed to get a remote screen-shot on the terminal, which was even more unnerving (we figured if Dropbear access was broken, maybe we could log in at the console):

Oh that is an UGLY screen.  After about 30 minutes of scrambling (which is too long – #LESSON1), we realised our server was dead until we could physically get back to it.  Every office IT service was down: our Nextcloud server (mission-critical), our office document server (essential for on-the-road work), our two web sites (this being one of them).  NOTHING worked.  Everything is dead and gone.  Including of course this web site and all the prior posts.

This was our first real-world catastrophic failure.  We had trained for this a couple of times, but did not expect to put that practice into effect.

Today was REAL for us.  So, after too long scrambling in vain to fix the primary server (30 minutes of blackout for us and our customers), we 2FA SSH’d into our live backup server (#1 of 2) and reconfigured a few IP addresses.  We had virtually complete BACKUP COPIES of our lxc containers on server#2.  We fired them up, and took a sharp intake of breath…

And it WORKED.  Just as it SHOULD.  But we are so glad it did anyway!  LXC ROCKS. 

Everything was “back to normal” as far as the world would be concerned.  It took maybe 15 minutes (we did not time it…) to get everything running again.   Web sites, office document file server, cloud server etc. – all up and running.  Same web sites, same SSL certs.  Same everything.  this web site is here (duh), as are all of our prior posts.  We lost a few scripts we were working on, and maybe six-months off our lives as we scrambled for a bit.

We don’t yet know what happened to our primary server (and won’t for a few days), BUT we think we hedged bets against ourselves in several ways: we are a small business.  So… we use the server hardware for local Desktop work too (it’s a powerful machine, with resources to spare).  We now think that’s a weakness: Ubuntu server edition is simply MORE STABLE than Ubuntu Desktop.  We knew that, but thought we would get away with it.  We were WRONG.  Also, we could have lost a little data because our LXC container backup frequency was low (some of these containers are large, so we copy en-mass on a non-daily basis).  We think we got lucky.  We don’t like that.  We think that single LXC backup strategy not ideal now either.  We also have all of our backup servers in one geo-location.  We have worried about that, and we do a little more so today.

All of these constitute a lessons-learned which we might actually document in a separate future article.  But today, boy, do we love our LXC containers.

But without a shadow of doubt, the primary takeaway here is: if you operate mission critical IT assets, you could do a lot worse than running your services in LXC containers.  We know of no downside, only upside.  THANK YOU, Canonical.

 

LXC Container Migration – WORKING

So we found a spare hour at a remote location and thought we could tinker a little more with lxc live migration as part of our LXD experiments.

Related image

We executed the following in a terminal as NON-ROOT users yet again:

lxc copy Nextcloud EI:Nextcloud-BAK-13-Sep-18

lxc start EI:Nextcloud-BAK-13-Sep-18

lxc list EI: | grep Nextcloud-BAK-Sep-13-Sep-18

And we got this at the terminal (a little time later…)

| Nextcloud-BAK-13-Sep-18 | RUNNING | 192.168.1.38 (eth0) | | PERSISTENT | 0 |

Note that this is a 138GB file.  Not small by any standard.  It holds every single file that’s important to our business (server-side AND end-to-end encrypted of course).  That’s a big file-copy.  So even at LAN speed, this gave us enough time to make some really good coffee!

So we then modified our front-end haproxy server to redirect traffic intended for our primary cloud-file server to this lxc instance instead. (Two minor changes to a config, replacing the IP address of the current cloud to the new cloud).  Then we restarted our proxy server and….sharp intake of breath…

IT WORKED BEAUTIFULLY!

Almost unbelievably, our entire public-facing cloud server was now running on another machine (just a few feet away as it happens).   We hoped for this, but we really did not expect a 138GB file to copy and startup first time.  #WOW

We need to test and work this instance to death to make sure it’s every bit as SOUND as our primary server, which is now back online and this backup version is just sleeping.

Note that this is a complete working copy of our entire cloud infrastructure – the Nextcloud software, every single file, all the HTTPS: certs, databases, configurations, OS – everything.  A user changes NOTHING to access this site, in fact, it’s not even possible for them to know it’s any different.

We think this is amazing, and is a great reflection of the abilities of lxc, which is why we are such big fans,

With this set-up, we could create working copies of our servers in another geo-location every, say, month, or maybe even every week (once a day is too much for a geo-remote facility – 138GB for this one server over the intenet?  Yikes).

So yes, bandwidth needed IS significant, and thus you can’t flash the larger server images over the internet every day, but it does provide for a very resistant disaster-recovery situation: if our premises go up in a Tornado, we can be back online with just a few clicks from a web-browser (change DNS settings and maybe a router setting or two) and issue a few commands from an ssh terminal, connected to the backup facility.

We will develop a proper, sensible strategy for using this technique after we have tested it extensively, but for now, we are happy it works.  It gives us another level of redundancy for our updating and backup processes.

GOTTA LOVE LXD

Image result for love LXD

An LXC Experiment for Live Backups

Don’t we all love our backups?  We all have them.  Some of us have backups done poorly, and some of us worry that ours is still not as good as we would like.  Few have it nailed.  We don’t have it nailed…

Here at EXPLOINSIGHTS, Inc. we think we are in the second camp (“not as good as we would like it to be”).  We have a ton of backups of our data, much of it offline (totally safe from e.g. malware), and some of those are in different locations (protected against theft, fire, flood, mayhem AND hackers), but they would all require a lot of work to get going if we ever needed them.  So, if we suffer a disaster (office burned to ground or swallowed up by a Tornado or house stolen by Hillary Clinton’s Russian Hackers), then rebuilding our system would still take time.  What we ALL want, but can seldom get, is a live backup that runs in parallel to our existing system.  Like a hidden ghost system that mirrors every change we make, without being exposed to the hazards of users and such.

So hold that thought…

Onto our favourite Linux Ubuntu capability: LXC.  LXC has a capability of basically exporting (copying or moving) containers from one machine to another – LIVE.  This means you don’t have to stop a container to take a backup copy of it and place it on ANOTHER MACHINE.  Theoretically, this is similar to taking a copy of a stopped container but without the drag of stopping it.

We know it has to be a pretty complicated under the hood for this to work,  and it’s evident it’s not really intended for a production environment, but we are going to play with this a little to see if we can use live migration to give us full working copies of our system servers on another machine.  And if we can, to place that machine not on the LAN, but on the WAN.

Our largest container is our Nextcloud instance.  We have multiple users, with all kinds of end-to-end encrypted AND simultaneously server-side encrypted data in multiple accounts.  All stored in one CONVENIENT container.  We are confident it’s SECURE from theft – we have tried to hack it.  All the useful data are encrypted.  But the container is growing.  Today it stands at about 138 GB.  Stopping that container and copying it even over the LAN is a slow process.  And that container is DOWN at that time.  If a user (or worse, a customer), tries to access the container to get a file, all they see is “server is down” in their browser.  #NotUseful

So for this reason, we don’t like to “copy” our containers – we hate the downtime risk.

So….we are going to play with live copying.  We have installed criu on two of our servers, and we are doing LAN-based experiments.  It’ll take time, as we have to copy-and-test.  Copy-and-test.  We have to make sure all accounts can be accessed AND that all data (double encrypted at rest, triple if you count the full-disk-encryption; quadruple if you count the https:// transport mechanism for copying) can be accessed without one byte of corruption.

Let the FUN begin.   We have written this short article as the first trial is underway.  We have our 138GB container on our “OB1” container (a Star Wars throwback):

See the last entry?  We are copying the RUNNING container to our second server (boringly called ‘exploinsights’).  It’s a big file, even for our super fast LAN router, and it’s not there even now:

The image has not yet appeared, but we have confirmed it’s still up ‘n running on the OB1 host.

Lots of testing to do, and clearly files this large can’t be backed up easily over the internet, so this is definitely an “as-well-as” non-routine option for a machine-to-machine back up, but we like the concept of this, so we are spending calories exploring it.

#StayTuned for an update.  Also, please let us know what you think of this – drop us an email at:

Admininstration@EXPLOINSIGHTS.COM or

ARWDCS@gmail.com

UPDATE:

I need a good Plan B: Thecopy failed after a delay of about two hours.  It would not take that long to copy, so something is broken, and we are not going to try to dig our way out of that hole.  LXC live migration died on the day we tried it.  #RIP

Server Backups using LXD

So I am working the process of server backups today.  Most people do backups wrong, and I have been guilty of that too.  You know it’s true when you accidentally delete a file, and you think ‘No worries, I’ll restore it from a backup…’; and about an hour later of opening archives and trying to extract the one file but finding some issue or other…makes you realize your backup strategy sucks.  I am thus trying to do get this right from the get-go today:
LXD makes the process easy (albeit with a few quirks).  EXPLOINSIGHTS Inc. (EI) servers are structured such that each service is running in an LXD container.  Today, there are several active, ‘production’ servers (plus several developmental servers, which are ignored in this posting):

  • Nextcloud – cloud file storage;
  • WordPress – this web-site/blog;
  • Onlyoffice – an ‘OnlyOffice’ document server;
  • Haproxy – the front-end server that routes traffic across the LAN

All of these services are running on one physical device.  They are important to EI as customers access these servers (whether they know it or not), so they need to JUST WORK.
What can I do if the single device (‘server1’) running these services just dies?  well I have battery backup, so a power glitch won’t do it.  Check.  And the modem/router are also UPS charged, so connectivity is good.  Check.  I don’t have RAID on the device but I do have new HD’s – low risk (but not great).  Half-check there.  And if the device hardware just crashes and burns just because it can…well that’s what I want to fix today:
So my way of creating functionally useful backups is to do the following, as a simple loop in a script file:

  1. For each <container name> on server1:
    1. lxc stop <container-name>
    2. lxc copy <container-name> TO server2:<container-name##>
    3. lxc restart <container-name>
  2. Next <container-name>

The ‘##’ at the end of the lxc copy command is the week-number, so I can create weekly container backups EASILY and store them on server2.  I had hoped to do this without stopping the containers, but the criu LXD add-on program (which is supposed to provide that very capability) is not performing properly on server2, so I have a brief server-outage when I run this script for each service for now.  I thus have to try to run this at “quite times”, if such a thing exists; but I can live with that for now.
I did a dry-run today: I executed the script, then I stopped two of the production containers.  I then launched the backup containers with the command:

  • lxc start <container-name##>

I then edited the LAN addresses for these services and I was operational again IN MINUTES.  The only user-experience change I noticed was my login credentials expired, but other than that it was exactly the same experience “as a user”.  Just awesome!
Such a strategy is of no use if you need 100% up-time, but this works for EI for now until I develop something better.  And to be clear, this solution is still far from perfect so it’s always going to be a work in progress:-
Residual risks include:

  1. Both servers are on same premises, so e.g. fire or theft risks are not covered;
    1. Really hard to fix this because of data residency and control requirements.
  2. This strategy requires human intervention to launch the backup servers, so there could be considerable downtime.  Starting a backup lxd container for the haproxy server will also require changes at the router (this one container receives and routes all http and https traffic except ssh/vpn connections.  The LAN router presently sends everything to this server.  A backup container will have a different LAN IP address thus router reconfig is needed);
  3. The cloud file storage container is not small – about 13GB today.  52 weeks of those will have a notable impact on storage at server2 (but storage is cheap);
  4. I still have to routinely check that the backup containers actually WORK (so practice drills are needed);
  5. I have to manually add new production containers to my script – easy to forget;
  6. I don’t like scheduled downtime for the servers…

But overall, today, I am satisfied with this approach.  The backup script will be placed in a cron file for auto-execution weekly.  I may make my script a bit more friendly by sending log files and/or email notification etc., but for now a manual check-up on backup status will suffice.

Ransomware

Another day, another government agency hit by ransomware:
Here
We need some serious effort to take down those behind these attacks; crypto currency does not help, as bad guys hide behind anonymous payments.  I am also left wondering how long before I get hacked AND whether my backup strategy will work.  That said, since my backups are NOT CONNECTED TO THE INTERNET so I at least have a fighting chance.
I backup whenever I am in my office, onto separate drives that are not internet connected, so ransomware cannot easily affect them.  That doesn’t help my systems, which can be rebuilt, but my data at least are relatively safe.
My Nextcloud instance provides some protection too, as file versioning means my changed files are always retained even if changed by malware.
Good luck to those who have to worry about this stuff.  #MeToo!
#Offline is sometimes the only way

Server Updates

Updating critical software is something not to be taken lightly.  It’s nerve-wracking when your business operations rely upon such systems.
What has helped EXPLOINSIGHTS Inc. (EI) sleep better at night is the extensive use of unprivileged containers or so-called virtual machines.  Most of the EI support software is installed on unprivileged LXC containers, which is a standard component of the Ubuntu 16.04 Linux distribution.
Today was a typical day for EI: an update to a major release of Nextcloud.  This critical software houses EI’s data and is the hub for data sharing with customers and stakeholders.  If this upgrade goes wrong, my customers can’t download their files.  #Embarrassing – or maybe even worse; loss of critical data?  To make it more enjoyable, I am not in the office today – so the update has to be performed remotely via secure SSH.  That’s an excellent recipe for high stress…normally.
So how did EI mitigate this update risk? With the following simple command entered at the host machine terminal via secure SSH access (i.e. WITHOUT SuperUser privileges!):
LXC snapshot NC pre-13-upgrade
That’s it.  Painless.  Super-safe (no Superuser rights!).  Blindingly fast.  Very efficient.  And this creates a full working snapshot of the EI current cloud configuration – files, links, settings, SSL-certs, SQL database, apache2 configs – absolutely everything needed to completely restore the setup should the upgrade process break something critical.
Breaking this command down:

  • LXC – this is the command we issue to fire up the Ubuntu LXC/LXD virtual machine management hypervisor, followed by three parameters:
    • snapshot – tells LXC to take a full working snapshot of the running instance;
    • NC – the name of the EI container that runs the Nextcloud instance – the one we want to backup;
    • pre-13-upgrade – a name assigned to the snapshot (easy to remember).

Yes, it’s that simple.  After that, the Nextcloud upgrade process was initiated…and as it happens, everything went smoothly, so the snapshot was NOT actually needed to recover the pre-version-13 upgrade – but it will be kept for a while just to make sure there are no bugs waiting in the shadows.  Here’s the new EI cloud instance:

EI cloud software – UPDATED to latest version

If a major problem arises, then the following command entered at the same terminal, again as a non-SuperUser, restores the entire pre-version 13 instance:
LXC restore NC pre-13-upgrade
#NoWorries 🙂
This restore command overwrites the current instance with the pre-upgraded and fully functioning snapshot.  The only risk is losing files/links created since completing the upgrade process – way better than a total rebuild.
LXC makes it so convenient to update major platforms.  The entire process was fast, safe; easy.  And because all the work is performed in non-SuperUser unprivileged mode, it comes with the confidence of knowing you can’t accidentally break an important part of the core system on the way.  It’s so good, it’s almost boring – but only almost.
Checkout using LXC to run your small business support software, it’s better than prescription-grade sleeping tablets for helping you with the upgrade process!  Official documents are here.  And there’s a ton of useful tutorials to get you started – Google is your friend.