Our Real-World Response to a Major Power Loss

So we suffered a MAJOR power failure in our office district earlier this week.  A major snowfall came to town, and a few lines and supplies got damaged.  Damaged so badly, that in our case we lost power.  Not for the usual few seconds or even minutes (which our battery backups can easily handle), but for what turned out to be a whole day.

Our servers, all three of them, went DOWN after the batteries powering them became exhausted.  (We were not present when they went down, we just knew it was coming…).

For a small-business IT failure, this is a problem.  We don’t have a ‘gaggle’ of IT staff waiting in the wings to take care of issues like this.  So, we don’t have a redundant array of back-up servers that can seamlessly kick-in presumably like the big IT giants have.

BUT we did have a plan.  AND it did (most) of the job.  AND what made it easier for us was our favorite service: LXC.

We routinely make live snapshot backups of our LXC containers.  We have several of them:

  • HAPROXY (our front end traffic handler)
  • Our Primary Nextcloud file sharing and cloud storage (mission-critical)
  • Our VPN server
  • Our OnlyOffice Document server
  • TWO web sites (this being one of them) 

We have backup containers stored across our servers, one of which is located in a facility that uses a different Electrical power utility company, deliberately located away from the primary head-office.  And THAT is what we brought online.    So what happened?  We learned of the power cut (you can get text-messages from most utility companies, but don’t expect them to be quick – so maybe get someone to call you too!).  We spun-up our backup LXC containers, then we went to our domain name service web portal and changed the IP address of our HAPROXY server (switched from the Head Office to the alternate location).  We had to wait about an hour for the DNS records to get propagated.  Then…our services were back online.

How did we do?

We did “OK”.  We did not do “GREAT”.  Embarrassingly, one of our LXC containers had NOT been copied at all due to human error.  It turned out we had an experimental/development LXC container named as our primary web site.  So we thought we had a copy of the web-site container, but in fact, it was a container for a new WordPress Install …intended to become our primary web server.  We had to scramble there, so we give ourselves 3/10 for that one.  We also had THIS website that did work flawlessly except that the last article posted did not get propagated to the backup server.  We give us 9/10 for that one.

The HAPROXY server worked brilliantly – 10/10.   We did not have OnlyOffice copied over – 0/10 there.  And the big one; the Nextcloud server?  Well that worked BUT the url had to change.  It was the ONLY change.  We did not lose a file, but we really don’t like having to change our url.  So…we are going to try to devise  a solution to that, but we give us 9/10 for that one, since no services or files or sharing services were lost or even delayed.  

Oh, and THIS TIME our router worked as it should.  We had our settings correctly saved (a lesson from a prior power failure, but this time we did not learn it the hard way again!)

Overall?  We give us an 8/10 as we had to scramble a little as we realized a web container was NOT what we thought it was.  It’s not mission critical, but it is ‘mission-desirable’, so we have to deduct some points.

We are completely convinced we could NOT have done this if we did not use the totally AWESOME LXC service.  It is such an incredibly powerful Linux service.  

MAJOR LESSONS AND TAKEAWAYS

  • Power utility companies SUCK at informing customers of power outages.  Don’t expect this to be reliable – we have a plan to be less reliant on them telling us about their outages in the future.
  • DNS propagation takes time.  That will give you some down-time if you have to change an IP address.  We think there’s a workaround here, using MULTIPLE domain names for each service, but all having one URL in common.  That has to be tested, and then implemented…and then we’ll write about it.  🙂
  • Erm, perhaps run LXC container copies in a SCRIPT so that you don’t get confused by poorly names R&D containers.  <BLUSH>
  • An LXC lesson: containers DO NOT ALWAYS AUTOSTART after an unplanned shutdown of the PC.  You have to manually restart them after boot up.  This is EASY to do but you have downtime when you are used to them automatically starting after boot-up.  They auto-started in two of our servers, but not a third.  Go figure…
  • We have to figure out how to use ‘WAKE-ON-LAN’ to reliably avoid requiring a human to press a power button on a server.  More on that to come another day we think…
  • Having a PLAN to deal with a power outage is a BRILLIANT IDEA, because you probably will have a major outage one day.  A plan that involves an alternate power-company is a GREAT idea.  We had a plan.  We had an alternate location.  We still had services when the districts power went down.