Uncategorized – EXPLOINSIGHTS, INC. Sys-Admin

December 21, 2018December 27, 2018

Nextcloud 15 – Evaluation Update

*** FURTHER UPDATE: 27-DEC-18
We continued operating the Nextcloud 15 instances in development mode. We are convinced it has some operational bugs but as far as we can tell they do not impact stability or security (they are ‘annoying’ because they are pervasive). But the instances themselves are stable and operating without significant issue. We have continued with the auto-syncing instances plan in these development servers and that has performed well. They act as one server: you change a file in one instance, and within a few minutes, all servers have the changes. It’s not terribly useful because it does not sync file-links and such, but it’s a reasonable back-up option for a power cut on one of the servers. We are going to upgrade this development set-up to Production status and begin the retirement of our Nextcloud 13 servers. This could be an interesting phase, since the Version 13 servers have performed well (but are approaching end of update life).
*** END OF UPDATE

As much as we really hate to say this: Nextcloud 15 is a poor replacement for Nextcloud 13, albeit SO FAR. Just too many bugs to make it even worthy of our time for evaluating right now. We have reported some bugs, but not all of them as they are too non-reproducible. Maybe we were too quick off the mark, but one thought struck us: would we have fielded our first Nextcloud a year ago if we had seen as many errors and strange results in THAT version? #NotSure. We are very glad we have Nextcloud version 13 running – it has proven to be rock-solid for us.

We will wait for the next stable release of Nextcloud version 15 before we even TRY to evaluate this updated version further. Hopefully it will be a lot more consistent and error-free.

We still like our plan to set up several auto-synchronizing Nextcloud ‘nodes’, but we have abandoned our plans to look at using Nextcloud 15 for this project, so it goes on-hold for a while.

December 20, 2018December 20, 2018

Nextcloud Release 15

Well we got on the Nextcloud 15 release quickly and have created our first networked Nextcloud server instances. These are still a development effort – not yet good enough for Production use, but maybe after a patch or two they will be.

Our NEW configuration involves several simultaneous networked Nextcloud installs, so we have built-in server-independant redundancy. If any one instance goes down, the others are ready to take up the slack.

This is the first time we have tried a configuration like this, and it’s really just a step on the journey for a more robust system that has better Operational Continuity.

We have servers that will, once the configuration is complete, be operated in different geographical locations to provide for power-outage or an even more catastrophic event at any one location (think meteorite hitting the office…). And we have tried to make this client-invisible: we have done this by configuring EACH server with basically two SSL certs – one of which is a unique server-specific SSL, the other is an SSL shared by all servers:

server 1 www.server1.com and cloud.server.com
server2 www.server2.com and cloud.server.com
server3 www.server3.com and cloud.server.com

The servers are sync’d via our sftp installations, which is the heart of our cloud file access system. A file can be changed either via the web portal or via an sftp connection, and the file changes are propagated quickly to each of the Networked servers. This gives each server a copy of the files – AND the revised files. That in itself provides a back-up capability, but that’s not what we did this for:

The primary cloud.server.com is IP configured at our DNS server (we use google’s service) and if the current live ‘cloud.server.com’ site goes down (power cut, malfunction, theft, fire or famine…etc.), we can redirect the DNS server and change the IP address to the next server. This takes a little time for now (it’s a manual change at the google DNS server for now) but what it does allow is for us to provide a file link from cloud.server.com and it will work at any/all of servers 1, 2 or 3 user-transparently. This is still a bit fiddly and we know we really need to do more, but it’s a start.

Ultimately we want to automate this a lot more (having to change a DNS record requres human intervention – an automatic weak-link). But for now, it should give us somewhat improved operational continuity when we get our next major outage at one of the server locations.

Nextcloud’s server installation probing seems happy with our configuration:

This was welcome, as the installation process has changed from prior versions.

And SSL labs is OK with our HTTPS configuration too (and yes, we note that TLS 1.1 is still enabled – we have not yet pulled the trigger on that one, but we are likely too once we are sure it won’t impact our customers):

We have a lot of work to do before we can launch this new configuration for Production, but hopefully this won’t take too long.

Nextcloud 15 has a lot of life in it, so hopefully this will allow us to find the time to further strengthen our Operational Continuity further. But this is not a bad start. 🙂

December 13, 2018

Our Real-World Response to a Major Power Loss

So we suffered a MAJOR power failure in our office district earlier this week. A major snowfall came to town, and a few lines and supplies got damaged. Damaged so badly, that in our case we lost power. Not for the usual few seconds or even minutes (which our battery backups can easily handle), but for what turned out to be a whole day.

Our servers, all three of them, went DOWN after the batteries powering them became exhausted. (We were not present when they went down, we just knew it was coming…).

For a small-business IT failure, this is a problem. We don’t have a ‘gaggle’ of IT staff waiting in the wings to take care of issues like this. So, we don’t have a redundant array of back-up servers that can seamlessly kick-in presumably like the big IT giants have.

BUT we did have a plan. AND it did (most) of the job. AND what made it easier for us was our favorite service: LXC.

We routinely make live snapshot backups of our LXC containers. We have several of them:

HAPROXY (our front end traffic handler)
Our Primary Nextcloud file sharing and cloud storage (mission-critical)
Our VPN server
Our OnlyOffice Document server
TWO web sites (this being one of them)

We have backup containers stored across our servers, one of which is located in a facility that uses a different Electrical power utility company, deliberately located away from the primary head-office. And THAT is what we brought online. So what happened? We learned of the power cut (you can get text-messages from most utility companies, but don’t expect them to be quick – so maybe get someone to call you too!). We spun-up our backup LXC containers, then we went to our domain name service web portal and changed the IP address of our HAPROXY server (switched from the Head Office to the alternate location). We had to wait about an hour for the DNS records to get propagated. Then…our services were back online.

How did we do?

We did “OK”. We did not do “GREAT”. Embarrassingly, one of our LXC containers had NOT been copied at all due to human error. It turned out we had an experimental/development LXC container named as our primary web site. So we thought we had a copy of the web-site container, but in fact, it was a container for a new WordPress Install …intended to become our primary web server. We had to scramble there, so we give ourselves 3/10 for that one. We also had THIS website that did work flawlessly except that the last article posted did not get propagated to the backup server. We give us 9/10 for that one.

The HAPROXY server worked brilliantly – 10/10. We did not have OnlyOffice copied over – 0/10 there. And the big one; the Nextcloud server? Well that worked BUT the url had to change. It was the ONLY change. We did not lose a file, but we really don’t like having to change our url. So…we are going to try to devise a solution to that, but we give us 9/10 for that one, since no services or files or sharing services were lost or even delayed.

Oh, and THIS TIME our router worked as it should. We had our settings correctly saved (a lesson from a prior power failure, but this time we did not learn it the hard way again!)

Overall? We give us an 8/10 as we had to scramble a little as we realized a web container was NOT what we thought it was. It’s not mission critical, but it is ‘mission-desirable’, so we have to deduct some points.

We are completely convinced we could NOT have done this if we did not use the totally AWESOME LXC service. It is such an incredibly powerful Linux service.

MAJOR LESSONS AND TAKEAWAYS

Power utility companies SUCK at informing customers of power outages. Don’t expect this to be reliable – we have a plan to be less reliant on them telling us about their outages in the future.
DNS propagation takes time. That will give you some down-time if you have to change an IP address. We think there’s a workaround here, using MULTIPLE domain names for each service, but all having one URL in common. That has to be tested, and then implemented…and then we’ll write about it. 🙂
Erm, perhaps run LXC container copies in a SCRIPT so that you don’t get confused by poorly names R&D containers. <BLUSH>
An LXC lesson: containers DO NOT ALWAYS AUTOSTART after an unplanned shutdown of the PC. You have to manually restart them after boot up. This is EASY to do but you have downtime when you are used to them automatically starting after boot-up. They auto-started in two of our servers, but not a third. Go figure…
We have to figure out how to use ‘WAKE-ON-LAN’ to reliably avoid requiring a human to press a power button on a server. More on that to come another day we think…
Having a PLAN to deal with a power outage is a BRILLIANT IDEA, because you probably will have a major outage one day. A plan that involves an alternate power-company is a GREAT idea. We had a plan. We had an alternate location. We still had services when the districts power went down.

November 3, 2018November 11, 2018

TLS 1.3 has rolled out for Apache2

The best cryptography in HTTPS got a helping hand for wider adoption, as Apache2 now incorporates TLS 1.3 support.

We have updated Apache2 in some of our LXC-based servers already, and the rest will be completed soon enough. Apache version 2.4.37 gives us this upgraded TLS support. #AWESOME.

And this upgrade is non-trivial. TLS 1.3 is very robust for deterring eavesdropping on your connections, for even a determined hacker. This is another step to improving the security of the internet, and we welcome and support it. TLS 1.3 is also FASTER, which is a welcome side-effect.

As part of our server-side Apache upgrade, this site now offers TLS 1.3 to your browser during the https handshake. And it works too, as shown in a snapshot of this site/post from one of our Android mobile devices:

“The connection uses TLS 1.3.” 👍

We are now contemplating disabling the cryptographically weaker TLS 1.1 connections with our sites now, which might block some users who still deploy old browsers, but it will make the connections more secure. We are thinking that perhaps causing some customer inconvenience (by blocking TLS 1.1) outweighs the risk of malware /cyberattacks on what might be OUR data. We encourage EVERYONE who visits this site to use modern, up-to-date browsers like Google Chrome, Firefox etc. We’ll post an update when we make the decision to actively block TLS 1.1, but if you use a really old browser, you might not ever read it because this site too will drop support of TLS 1.1 once we roll out that policy. 🙂

For the curious, we recommend you visit the excellent SSL-Labs site to test your servers (if any), the sites you visit and your actual web-browser. Happy browsing!

October 3, 2018

Backup Server Activation: This Was Not a Drill #LXC-Hero

Well what a time we had this week. There we were, minding our own business, running a few standard:

sudo apt update && sudo apt upgrade

…commands as our server notified us of routine (so we thought) Ubuntu updates. We have done this many many times. So, what could go wrong?

After this particular update, which included a kernel change, we were given that lovely notice that says “a reboot is required to make changes take effect”. We never like that.

We were out of the office. But this was a security update, so it’s kinda important. #AGONISE-A-LITTLE. So, we went to a backup server first, and performed the same update (it was the same OS and it needed the same patches). We remotely rebooted the backup server and it worked beautifully. That made us feel better (#FalseSenseOfSecurity). So, on our primary server, we issued:

sudo reboot

…at the terminal, as we had done many many times before. As usual, the SSH connection was terminated without notice. We don’t like that, but that’s the nature of the beast. We waited to login into our Dropbear SSH terminal so we can remotely unlock our encrypted drives. With some relief, it appeared! YAY. We typed in our usual single command and hit the return key:

unlock

We normally get a prompt for our decryption credentials. In fact, we ALWAYS get a prompt for our decryption credentials. #NotToday

Not only did we see something new, it was also, as far as we can google, unique for a Dropbear login:

WhiskyTangoFoxtrot (#WTF). We are not trying to kill a process. We are trying to unlock our multiple drives. What is going on? We logged back in, and got the same result. This was not a badly typed command. This was real. Our primary server was down. And we mean DOWN. The Kill process is part of the unlock script, which means the script is not working…which means the OS can’t find the primary encrypted drive. We actually managed to get a remote screen-shot on the terminal, which was even more unnerving (we figured if Dropbear access was broken, maybe we could log in at the console):

Oh that is an UGLY screen. After about 30 minutes of scrambling (which is too long – #LESSON1), we realised our server was dead until we could physically get back to it. Every office IT service was down: our Nextcloud server (mission-critical), our office document server (essential for on-the-road work), our two web sites (this being one of them). NOTHING worked. Everything is dead and gone. Including of course this web site and all the prior posts.

This was our first real-world catastrophic failure. We had trained for this a couple of times, but did not expect to put that practice into effect.

Today was REAL for us. So, after too long scrambling in vain to fix the primary server (30 minutes of blackout for us and our customers), we 2FA SSH’d into our live backup server (#1 of 2) and reconfigured a few IP addresses. We had virtually complete BACKUP COPIES of our lxc containers on server#2. We fired them up, and took a sharp intake of breath…

And it WORKED. Just as it SHOULD. But we are so glad it did anyway! LXC ROCKS.

Everything was “back to normal” as far as the world would be concerned. It took maybe 15 minutes (we did not time it…) to get everything running again. Web sites, office document file server, cloud server etc. – all up and running. Same web sites, same SSL certs. Same everything. this web site is here (duh), as are all of our prior posts. We lost a few scripts we were working on, and maybe six-months off our lives as we scrambled for a bit.

We don’t yet know what happened to our primary server (and won’t for a few days), BUT we think we hedged bets against ourselves in several ways: we are a small business. So… we use the server hardware for local Desktop work too (it’s a powerful machine, with resources to spare). We now think that’s a weakness: Ubuntu server edition is simply MORE STABLE than Ubuntu Desktop. We knew that, but thought we would get away with it. We were WRONG. Also, we could have lost a little data because our LXC container backup frequency was low (some of these containers are large, so we copy en-mass on a non-daily basis). We think we got lucky. We don’t like that. We think that single LXC backup strategy not ideal now either. We also have all of our backup servers in one geo-location. We have worried about that, and we do a little more so today.

All of these constitute a lessons-learned which we might actually document in a separate future article. But today, boy, do we love our LXC containers.

But without a shadow of doubt, the primary takeaway here is: if you operate mission critical IT assets, you could do a lot worse than running your services in LXC containers. We know of no downside, only upside. THANK YOU, Canonical.

September 22, 2018September 23, 2018

Encrypting and auto-boot-decryption of an LXC zpool on Ubuntu with LUKS

Image result for luks key list

So we have seen some postings online that suggested you can’t encrypt an lxd zpool, such as this GitHub posting here, which correctly explains that an encrypted zpool that doesn’t mount at startup disappears WITHOUT WARNING from your lxd configuration.

It’s not the whole picture as it IS possible to so encrypt an lxd zpool with luks (the standard full disk encryption option for Linux Ubuntu) and have it work out-of-the-box at startup, but perhaps it’s not as straightforward as everyone would like

WARNING WARNING – THE INSTRUCTIONS BELOW ARE NOT GUARANTEED. WE USE COMMANDS THAT WILL WIPE A DRIVE SO GREAT CARE IS NEEDED AND WE CANNOT HELP YOU IF YOU LOSE ACCESS TO YOUR DATA. DO NOT TRY THIS ON A PRODUCTION SERVER. SEEK PROFESSIONAL HELP INSTEAD, PLEASE!

With that said…this post is for those who, for example, have a new clean system that they can always do-over if this tutorial does not work as advertised. Ubuntu OS changes and so the instructions might not work on your particular system.

Firstly, we assume you have your ubuntu 16.04 installed on a luks encrypted drive (i.e. the standard ubuntu instal using the “encrypt HD” option). This of course requires you to enter a password at boot-up to decrypt your system, something like: Image result for ubuntu full disk encryption

We assume you have a second drive that you want to use for your linux lxd containers. That’s how we roll our lxd.

So, to setup an encrypted zpool, select your drive to be used (we assume it’s /dev/sdd here, and we assume it’s a newly created partition that is not yet formatted – your drive might be /dev/sda, /dev/sdb or something quite different – MAKE SURE YOU GET THAT RIGHT).

Go through the normal luks procedure to encrypt the drive:

sudo cryptsetup -y -v luksFormat /dev/sdd

Enter the password and NOTE THE WARNING – this WILL destroy the drive contents. #YOUHAVEBEENWARNED

Then open it:

sudo cryptsetup luksOpen /dev/sd?X sd?X_crypt

Normally, you would create your normal file system now, such as an ext4, but we don’t do that. Instead, create your zpool (we are calling ours ‘lxdzpool’ – feel free to change that to ‘tank’ or whatever pool name you prefer):

sudo zpool create -f -o ashift=12 -O normalization=formD -O atime=off -m none -R /mnt -O compression=lz4 lxdzpool  /dev/mapper/sdd_crypt

And there you have an encrypted zpool. Add it to lxd using the standard ‘sudo lxd init’ procedure that you need to go through to create lxc containers, then start launching your containers and voila, you are using an encrypted zpool.

So, we are not done yet. We can’t let the OS boot up without decrypting the zpool drive, lest our containers disappear and lxd goes back to using a directory for its zpool, per the GitHub posting referred to above. That would not be good. So how do we make sure this is auto-decrypted at boot-up (which is needed for lxc containers to launch)?

Well, we have to create a keyfile that is used to decrypt this drive after you decrypt the main OS drive (so you do still need to decrypt your PC at bootup as usual – as above):

sudo dd if=/dev/urandom of=/root/.keyfile bs=1024 count=4
sudo chmod 0400 /root/.keyfile
sudo cryptsetup luksAddKey /dev/sdd /root/.keyfile

This creates keyfile at /root/.keyfile. This file is used to decrypt the zpool drive. Just answer the prompts that these commands generate (self explanatory).

Now find out your disks UUID number with:

sudo blkid

This should give you a list of your drives with various information. We need the long string that comes after “UUID=…” for your drive, e.g.:

/dev/sdd: UUID=”971bf7bc-43f2-4ce0-85aa-9c6437240ec5″ TYPE=”crypto_LUKS”

Note we need the UUID – not the PARTUUID or anything else. It must say “UUID=…”.

Now edit /etc/crypttab as root:

sudo nano /etc/crypttab

And add an entry like this:

#Add entry to aut-unlock the encrypted drive at boot-up,
#after the main OS drive has been unlocked
sdd_crypt UUID=971bf7bc-43f2-4ce0-85aa-9c6437240ec5 /root/.keyfile luks,discard

And now reboot. You should see your familiar boot-up screen for decrypting your ubuntu OS. And once you enter the correct password, the encrypted zfs zpool drive will be automatically decrypted and will allow lxd to access it as your zpool. Here’s an excerpt from our ‘lxc info’ output AFTER a reboot. We highlighted the most important bit for this tutorial:

$ lxc info
config:
storage.zfs_pool_name: lxdzpool
api_extensions:
– id_map
– id_map_base
– resource_limits
api_status: stable
api_version: “1.0”
auth: trusted
auth_methods: []
public: false
driver: lxc
driver_version: 2.0.8
kernel: Linux
kernel_architecture: x86_64
kernel_version: 4.15.0-34-generic
server: lxd
storage: zfs

Note we are using our ‘lxdzpool’.

We hope this is useful. GOOD LUCK!

Useful additional reference materials are here (or at least, they were here when we posted this article):

Encrypting a second hard drive on Ubuntu (post-install)

Setting up ZFS on LUKS

September 13, 2018

Another News Scare on Full Disk Encryption Hacking

Another day, another scary headline:

Security flaw in ‘nearly all’ modern PCs and Macs exposes encrypted data

Don’t get us wrong, we don’t discount this as false. It’s almost certainly not.

But for us, we never ever rely on one lock for our IT systems. Full disk encryption? Sure, we got it. But we also server-side encrypt our data AND we end to end encrypt our most important data. Three levels of encryption. Each with a completely different software package. All Open Source.

We also 2FA protect out logins for all key accounts (email, ssh access, cloud and even our web site portal).

We note this headline, but then go about our day.

Don’t let the headlines scare you too much!

September 13, 2018

LXC Container Migration – WORKING

So we found a spare hour at a remote location and thought we could tinker a little more with lxc live migration as part of our LXD experiments.

Related image

We executed the following in a terminal as NON-ROOT users yet again:

lxc copy Nextcloud EI:Nextcloud-BAK-13-Sep-18

lxc start EI:Nextcloud-BAK-13-Sep-18

lxc list EI: | grep Nextcloud-BAK-Sep-13-Sep-18

And we got this at the terminal (a little time later…)

| Nextcloud-BAK-13-Sep-18 | RUNNING | 192.168.1.38 (eth0) | | PERSISTENT | 0 |

Note that this is a 138GB file. Not small by any standard. It holds every single file that’s important to our business (server-side AND end-to-end encrypted of course). That’s a big file-copy. So even at LAN speed, this gave us enough time to make some really good coffee!

So we then modified our front-end haproxy server to redirect traffic intended for our primary cloud-file server to this lxc instance instead. (Two minor changes to a config, replacing the IP address of the current cloud to the new cloud). Then we restarted our proxy server and….sharp intake of breath…

IT WORKED BEAUTIFULLY!

Almost unbelievably, our entire public-facing cloud server was now running on another machine (just a few feet away as it happens). We hoped for this, but we really did not expect a 138GB file to copy and startup first time. #WOW

We need to test and work this instance to death to make sure it’s every bit as SOUND as our primary server, which is now back online and this backup version is just sleeping.

Note that this is a complete working copy of our entire cloud infrastructure – the Nextcloud software, every single file, all the HTTPS: certs, databases, configurations, OS – everything. A user changes NOTHING to access this site, in fact, it’s not even possible for them to know it’s any different.

We think this is amazing, and is a great reflection of the abilities of lxc, which is why we are such big fans,

With this set-up, we could create working copies of our servers in another geo-location every, say, month, or maybe even every week (once a day is too much for a geo-remote facility – 138GB for this one server over the intenet? Yikes).

So yes, bandwidth needed IS significant, and thus you can’t flash the larger server images over the internet every day, but it does provide for a very resistant disaster-recovery situation: if our premises go up in a Tornado, we can be back online with just a few clicks from a web-browser (change DNS settings and maybe a router setting or two) and issue a few commands from an ssh terminal, connected to the backup facility.

We will develop a proper, sensible strategy for using this technique after we have tested it extensively, but for now, we are happy it works. It gives us another level of redundancy for our updating and backup processes.

GOTTA LOVE LXD

Image result for love LXD

September 11, 2018September 11, 2018

An LXC Experiment for Backups – Take 2

Remember this recent article: An LXC Experiment for Live Backups?

It was out first attempt to perform live migrations of one LXC container on one physical machine, to a new container running on a completely different machine. The idea being to create live containers with real-world very current information that can act as part of a complete disaster-resistant backup strategy.

The plan failed as the copy process gave us errors. This may be due to a timeout of SSH (but it shouldn’t as the files were not THAT big for our LAN network speed). It could also have been due to trying to restore container cpu states on a different machine – maybe it’s too much for lxc. It doesn’t matter, it just failed and so we had to rethink.

We have…a new plan. What if we take a SNAPSHOT of a container and IMMEDIATELY copy that (in a ‘stopped’ state of course)? No cpu registers and memory to worry so much about as part of a copy. The container is in a re-start-able form, not a running state.

Something like:

lxc snapshot Nextcloud Snapshot-name 

lxc copy Nextcloud/Snapshot-name NEWMACHINE:NextcloudMirror

lxc start NEWMACHINE:NextcloudMirror

This series of non-sudo user commands (that’s right, no scary super-user stuff again) creates then copies the container ‘Nextcloud’ snapshot named ‘Snapshot-name’ to a new lxc node called ‘NEWMACHINE’. The node is an lxc remote system – can be anywhere: same machine, same network, different machine in a different country connected via internet over public-private ssh connection – all handled by lxc).

Well we tried this…it WORKS. Lots of testing to do, but we are very excited at this prospect. More to come!

🙂

September 6, 2018

run-one

Image result for happy sysadmin Our favorite new command-of-the-day:

run-one (found at Ubuntu’s Manpage – here)

Just made it easier to run a single instance of rsync for one of our routine server-to-server file copy jobs.

Image result for rsync