Ubuntu 16.10 LXC host on ZFS Root, with EFI and Time Machine

25 Apr 2017

Categories

Boat (27) 
Not the Boat (12) 

Tags

Recent articles

25 Apr 2017

Ubuntu 16.10 LXC host on ZFS Root, with EFI and Time Machine

Still completely unrelated to boats, but I needed somewhere to put this. Here is a blow-by-blow guide to installing a minimal Ubuntu 16.10 to a ZFS root, booted from EFI, which as used as a LXC host to act as an Apple "Time Machine" destination.
mike 25 Apr 2017 at 17:20
14 Mar 2017

How to connect any serial device to the internet

A completely generic script that will proxy serial devices over HTTP, turning USB-things into internet-things.
mike 14 Mar 2017 at 23:00

If you're here you have probably already found this document: Ubuntu 16.10 Root on ZFS. This is a modified and expanded version of that, which aims to be a step by step guide to installing a root system to ZFS which boots from a modern EFI system.

Once installed, I show how to install LXC so it plays nicely with ZFS, and install a VM that serves as a Time Machine destination. We use this exact setup so the Macs in the office can back up their files over the network to a proper computer, with proper storage, rather than a slow, crufty and poorly supported NAS running insecure or out of date software. There are some caveats:

  • These steps are for EFI boot only - I've finally made the jump away from MBR
  • I presume a Raid-Z install to new disks (4 x SSD in my case). Raid-Z2 is almost identical.
  • Root system is absolutely minimal and designed to run LXC and nothing else
  • I prefer dynamic IPs and Zeroconf announcment of host names, but this aspect is easy enough to skip if you prefer it old-school

Part 1: Installing Ubuntu Root to ZFS

Download Ubuntu 16.10 Live CD and start it up. Open a Terminal (Ctrl-T), and if you are connecting remotely to run the setup, run apt --yes install openssh-server then log in remotely with ssh ubuntu@ubuntu.local. Then run the below commands

    sudo -s
    apt-add-repository universe
    apt update
    apt --yes install debootstrap gdisk zfs-initramfs
    

Now we partition the disks. You can do this using /dev/disk-by-id/XXX as in the original guide, although if you have a lot it's easy enough to lose track. At this point it's simpler to ls -l /dev/sd? to see which disks are in your system, identify which one's you're installing to, and run the following commands for each one of those disks:

    sgdisk -n9:-9M:0 -t9:BF07 /dev/sdX
    sgdisk -n3:1M:+512M -t3:EF00 /dev/sdX
    sgdisk -n1:0:0 -t1:BF01 /dev/sdX
    

I found I had to run the above in that order. They create an 8MB "Solaris Reserved" partition at the end of the disk, for a reason I have yet to determine beyond "that's what Solaris does". It creates an EFI partition of 512M at the start of the disk, and finally it creates the primary ZFS partition on the rest of the disk.

Now create the zpool. For this command, you should definitely use the full /dev/disk/by-id/NNN path, not the /dev/sdX equivalent. If you've ever pulled all the disks out of a RAID array then put them back in a different order, you'll understand why this is recommended.

You will be adding partition 1 of each of the newly-partitioned disks, so the path for each will end with "part1". I am creating this on 4 x Samsung Evo 850 SSDs, so I will be specifying 8KB blocks by way of ashift=13. Then create a single ZFS filesystem tank/root for the root system. The original guide partitions root into sections for var, home etc. which (given this is building a simple LXC host) is an unnecessary complexity.

    zpool create -o ashift=13 -O atime=off -O canmount=off -O normalization=formD -O mountpoint=/ -R /mnt tank raidz /dev/disk/by-id/ata*Samsung_SSD_850*part1
    zfs set xattr=sa tank   # See https://github.com/zfsonlinux/zfs/issues/443
    zfs create -o canmount=noauto -o mountpoint=/ tank/root
    zfs mount tank/root
    

The new ZFS filesystem is now mounted on /mnt. Install your OS of choice. I am using Ubuntu Yaketty rather than Xenial LTS because, at the time of writing, Yaketty has a patch applied for a ZFS/GRUB2 bug which is not in Xenial.

    debootstrap yakkety /mnt
    mount --rbind /dev /mnt/dev
    mount --rbind /proc /mnt/proc
    mount --rbind /sys /mnt/sys
    chroot /mnt
    ln -s /proc/self/mounts /etc/mtab
    

You're now in the new environment. Install what you need to, but I would suggest the minimum for now. I set the hostname, timezone, locale, network, create a user and install a minimal list of packages.

    echo myhostname > /etc/hostname
    echo 127.0.1.1 myhostname >> /etc/hosts
    locale-gen en_US.UTF-8
    dpkg-reconfigure tzdata
    cat <<EOF > /etc/apt/sources.list
    deb http://archive.ubuntu.com/ubuntu yakkety main universe
    deb-src http://archive.ubuntu.com/ubuntu yakkety main universe

    deb http://security.ubuntu.com/ubuntu yakkety-security main universe
    deb-src http://security.ubuntu.com/ubuntu yakkety-security main universe

    deb http://archive.ubuntu.com/ubuntu yakkety-updates main universe
    deb-src http://archive.ubuntu.com/ubuntu yakkety-updates main universe
    EOF

    apt update
    apt --yes --no-install-recommends install linux-image-generic
    apt --yes install zfs-initramfs openssh-server nfs-kernel-server gdisk dosfstools grub-efi-amd64 # essential
    apt --yes install zsh vim screen psmisc avahi-utils # your favourites here
    useradd -m -U -G sudo -s /bin/bash mike
    passwd mike
    

Now we come to the part where we try to fix the daft interface names given to us by systemd, which has helpfully fixed the problem of predicable interface names in an unpredictable order by giving us unpredictable names in a predictable order. You can do this with a boot parameter, I prefer to rename them with udev rules so they're useful AND predictable. Identify your network interfaces with ifconfig -a, note the MAC addresses then set up a udev rule to give them the names you want. Also edit /etc/network/interfaces to ensure the interfaces comes up on boot:

    cat <<EOF > /etc/udev/rules.d/70-net.rules
    SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="XX:XX:XX:XX:XX:XX", NAME="eth0"
    EOF

    cat <<EOF > /etc/network/interfaces
    source-directory /etc/network/interfaces.d

    auto eth0
    iface eth0 inet dhcp
    EOF
    

Next, a possibly optional step: if at some point you share any of the ZFS fiesystems over NFS, be sure to edit /etc/default/zfs to set ZFS_SHARE='yes'

Now we need to install the bootloader. This was my first crack at EFI booting but it's pretty simple to set up. You need a DOS partition to store the EFI files which you'll mount in /boot/efi. This will be done identically on each disk, which means you can pull any disk out of the array and it will still boot. I tested this, and so should you. Unlike the original guide I am referring to the partition by its label, which makes the process a bit simpler: the fstab entry will mount the first partition found with a label of "EFI". As they're all identical, we don't care which one it is. Run the following:

    mkdir /boot/efi
    echo LABEL=EFI /boot/efi vfat defaults 0 1 >> /etc/fstab
    

Then repeat the following steps for each disk in your pool. You can reference them by /dev/sdX here, as I have for brevity:

    mkdosfs -F 32 -n EFI /dev/sdX3
    mount /dev/sdX3 /boot/efi
    grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu --recheck --no-floppy
    umount /boot/efi
    

Finally we come to configure grub. I have first edited /etc/default/grub to remove GRUB_HIDDEN_TIMEOUT, enable GRUB_TERMINAL, remove "quiet splash" from the kernal arguments and add "ipv6.disable=1", to disable IPV6 (which I'm not familiar enough with to secure to my satisfaction). These steps are useful for a server, but all entirely optional. To update grub, snapshot the root FS (nice idea from the original guide) and exit our chroot environment:

    update-initramfs -c -k all
    update-grub
    zfs snapshot tank/root@install
    exit
    

We're now back to our live-CD environment. Unmount cleanly and reboot:

    mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
    zpool export tank
    reboot
    

...and relax. Make a cup of tea. Some thing to note at this stage:

  1. As you're booting with EFI not BIOS, you might want to configure your firmware to "Disable CSM", or "Prioritise UEFI over Legacy" or whatever the appropriate setting is.
  2. If for any reason your zpool failed to export cleanly - perhaps you forgot to export it, or got impatient and hit the reset button - then your machine will boot to an "initramfs" environment. Simply type zpool import tank and then reboot to get through this.
  3. Assuming your newly-installed OS boots, now is the right time to verify that it will boot off any disk in the pool. Remove a disk and run sudo zpool status to verify the pool is marked as degraded:
    # zpool status
      pool: tank
     state: DEGRADED
    status: One or more devices could not be used because the label is missing or
            invalid.  Sufficient replicas exist for the pool to continue
            functioning in a degraded state.
    action: Replace the device using 'zpool replace'.
       see: http://zfsonlinux.org/msg/ZFS-8000-4J
      scan: resilvered 920K in 0h0m with 0 errors on Mon Apr 24 12:58:30 2017
    config:
    
            NAME                                                     STATE     READ WRITE CKSUM
            tank                                                     DEGRADED     0     0     0
              raidz1-0                                               DEGRADED     0     0     0
                16332688028184785458                                 UNAVAIL      0     0     0  was /dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S12SNEACB00821N-part1
                ata-Samsung_SSD_850_EVO_500GB_S2RBNX0H775160V-part1  ONLINE       0     0     0
                ata-Samsung_SSD_850_EVO_500GB_S2RBNX0H775161T-part1  ONLINE       0     0     0
                ata-Samsung_SSD_850_EVO_500GB_S2RBNX0H822928B-part1  ONLINE       0     0     0
        
    Reboot, and confirm the OS comes up nicely. Once it's up, run sudo zpool online /dev/disk/by-id/NNN to reimport the disk that was identified as missing. Rejoice that with RAID-Z, this takes only a few seconds to confirm the array is back to complete with zpool status. Then repeat with the other disks in your pool until you are satisfied.

Part 2: LXC on ZFS

We now have a very minimal system on a ZFS root, and I prefer to keep it that way. Anything beyond the basics I install in a Virtual Machine - the advantages of this are many, but for me the biggest is that I can leave the "root" system essentially untouched. Change introduces instability, and I don't like rebooting things. It's not just application software; I run a VPN in one VirtualMachine, filesharing in another (see below). The best part is that LXC and ZFS go together very nicely, but I find a few tweaks are required to Ubuntu out-of-the-box to make this go smoothly. First, the basics:

    apt upgrade
    apt --yes install lxc lxcfs
    zfs create tank/lxc
    

Installing "lxc" will cause a bridge called "lxbr0" to be dynamically created, which I dislike - I prefer to define the bridge manually in /etc/network/interfaces. This involves the following steps: you can use a static IP here if you prefer. I also like to set the MAC address template for my virtual machines to something I know, so I can easily recognise them on the LAN. Finally, I want the default storage for LXC hosts to be in ZFS under the "tank/lxc":

    ip link set lxcbr0 down

    cat <<EOF > /etc/network/interfaces
    source-directory /etc/network/interfaces.d

    auto br0
    iface br0 inet dhcp
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0
    EOF

    cat <<EOF > /etc/lxc/default.conf
    lxc.network.type = veth
    lxc.network.link = br0
    lxc.network.flags = up
    lxc.network.hwaddr = 00:00:00:ff:ff:xx
    EOF

    cat <<EOF > /etc/lxc/lxc.conf
    lxc.lxcpath = /var/lib/lxc
    lxc.bdev.zfs.root = tank/lxc
    EOF

    sed -i'' 's/USE_LXC_BRIDGE="true"/USE_LXC_BRIDGE="false"' /etc/default/lxc-net
    sed -i'' '/rlimit-nproc=3/d' /etc/avahi/avahi-daemon.conf
    

The last line works around a long-standing bug in the way Avahi is setup, at least on Debian when virtual machines are in use. You'll have to do this on the guest systems too, as shown below, although obviously only if you've installed avahi.

Reboot and verify the br0 bridge comes up. Now you can create your LXC VMs as you like. For example, to create a brand new VM running Xenial:

    lxc-create --template ubuntu --name $GUESTVM -- --release xenial
    

This will create a new VM with the configuration files stored under /var/lib/lxc/$GUESTVM, and a new ZFS filesystem called tank/lxc/$GUESTVM which is mounted under /var/lib/lxc/$GUESTVM/rootfs. I find this works nicely for me. To configure the new VM for remote access so I can run ssh $GUESTVM.local, I run the following:

    lxc-start --name $GUESTVM
    lxc-attach --name $GUESTVM
    apt --yes install openssh-server avahi-utils
    useradd -m -U -G sudo -s /bin/bash mike
    passwd mike
    sed -i'' '/rlimit-nproc=3/d' /etc/avahi/avahi-daemon.conf
    systemctl restart avahi-daemon
    userdel -r ubuntu
    exit
    

Another useful thing to do is to import a shared folder of some sort from the host system. The final part of this guide will show how to set up a Virtual Machine running Netatalk, which I will use for Time Machine backups for the Mac machines in the office.

Part 3: Time Machine backups to ZFS from a Virtual Machine

Here's where it gets fun. The supplied with Ubuntu is hopelessly outdated, so you have to build it from source. But why install (maintain, pollute...) all the build packages for this in your root system? Create a VM to build and run it. I have a VM for every major application we run at the office - this is broadly the same sort of vision as DropBox has, although I never got on with their implementation.

First we need so set up ZFS as a target for Time Machine. For sake of argument I'll be mounting mine under /home/timemachine/$USER. Create a share for each user, as below, and set the quota. Time Machine requires a quota, as it will expand to fill all the space available.

    zfs create -o mountpoint=/home/timemachine tank/timemachine
    zfs create -o quota=100G tank/timemachine/mike
    chown -R mike:mike /home/timemachine/mike
    

Now we need to create a new VM to run Netatalk, and we need it to have access to these folders. Here's how you can do this from scratch using Xenial as the guest OS. I'm mounting /home/timemachine under the same location in the VM and setting the VM to autostart. Use "rbind" rather than "bind" to ensure the ZFS filesystems for each user are available too. As usual, I'm installing openssh, avahi and adding a user. Installing Avahi is mandatory for Time Machine

    lxc-create --template ubuntu --name timemachine -- --release xenial
    cd /var/lib/lxc/timemachine
    echo "lxc.mount.entry = /home/timemachine home/timemachine none rbind 0 0" >> config
    echo "lxc.start.auto = 1" >> config
    mkdir rootfs/home/timemachine
    lxc-start --name timemachine
    lxc-attach --name timemachine
    apt --yes install openssh-server avahi-utils
    useradd -m -U -G sudo -s /bin/bash mike
    passwd mike
    sed -i'' '/rlimit-nproc=3/d' /etc/avahi/avahi-daemon.conf
    systemctl restart avahi-daemon
    userdel -r ubuntu
    

The next steps are to build Netatalk, which at the time of writing is 3.1.11. These steps are largely unmodified from the Netatalk wiki, although I have dropped the "tracker" package as it a) appears to require a package that depends on "gnome" to run, and b) is only used for spotlight, which we don't want for Time Machine (and in our office, we choose not to use for our other shares too). I've also set the install folders to be a bit more Linux and a bit less BSD. Run the below as root:

    apt --yes install build-essential libevent-dev libssl-dev libgcrypt-dev libkrb5-dev libpam0g-dev libwrap0-dev libdb-dev libtdb-dev libmysqlclient-dev avahi-daemon libavahi-client-dev libacl1-dev libldap2-dev libcrack2-dev systemtap-sdt-dev libdbus-1-dev libdbus-glib-1-dev libglib2.0-dev libio-socket-inet6-perl wget
    wget https://downloads.sourceforge.net/project/netatalk/netatalk/3.1.11/netatalk-3.1.11.tar.bz2
    tar jxvf netatalk-3.1.11.tar.bz2
    cd netatalk-3.1.11
    ./configure --with-init-style=debian-systemd --without-libevent --without-tdb --with-cracklib --enable-krbV-uam --with-pam-confdir=/etc/pam.d --with-dbus-daemon=/usr/bin/dbus-daemon --with-dbus-sysconf-dir=/etc/dbus-1/system.d --sysconfdir=/etc --localstatedir=/var --prefix=/usr
    make
    make install
    systemctl enable avahi-daemon
    systemctl enable netatalk
    

That's it. Configuration of Netatalk 3.x is a breeze compared to 2.x, there's only one file which (thanks to the flags you passed into configure, above) is at /etc/afp.conf. Here's what mine looks like:

    [Global]
      spotlight = no

    [Time Machine]
      path = /home/timemachine/$u
      time machine = yes
    

That's really all there is to it. Ensure you have accounts for any users that will be using this for backup, ensure they have passwords set, ensure the UIDs are the same as that for the host system and ensure that they own their respective "/home/timemachine/$USER" folder.

The $u in the configuration file is expanded to be the name of the current user, which means that "mike" will get the folder /home/timemachine/mike as a Time Machine destination. This way each user only has access to their own backups, and they get their own ZFS quotas too.

There is one more critical step required due to a bug which exists in the current release of Netatalk. Given I only filed this report about two hours ago it's not fixed yet, but it might be by the time you read this. Until then, you'll need a simple workaround.

    mkdir '/home/timemachine/$u'
    

Then run reboot on your virtual machine and you should be able to back up to the "Time Machine" share on the "timemachine" host.

Bonus: backup script

The joys of ZFS continue here, as backing this system up is pretty simple. I've got a large USB disk which I've formatted as EXT2 and given the label "backup" (actually I have two with this label, which I rotate). With this script in /etc/cron.daily all the shares that don't have the user property "backup:skip" set are backed up to that drive every night.

That means I can set that property by running zfs set backup:skip=true tank/NNN on any filesystem to prevent it being backed up. Of course you can reverse this test as you see fit.

#!/bin/sh 

# Backup all ZFS shares that don't have the "backup:skip" user property
# set to true. Backs these up to $DESTDIR, which we assume to be the
# set in /etc/fstab as the mountpoint for an external disk. The disk
# will be mounted if it is not already, and unmounted at the end.

DESTDIR=/backup
TMP=/tmp/mnt.$$

mkdir "$TMP"
echo `date +'%Y-%m-%d %H:%M:%S'` - Backing up to $DESTDIR
zpool status
grep -q " $DESTDIR " /proc/mounts || mount "$DESTDIR"
grep -q " $DESTDIR " /proc/mounts || { echo "$DESTDIR not mounted" ; exit 1 ; }
ALL=`zfs list -H -obackup:skip,name -tfilesystem | grep -v '^true' | cut -f 1 --complement`

for FS in $ALL ; do
    zfs snapshot "$FS@backup"
    mount -o ro -t zfs "$FS@backup" "$TMP"
    mkdir -p "$DESTDIR/$FS"
    rsync -ax "$TMP/" "$DESTDIR/$FS"
    echo `date +'%Y-%m-%d %H:%M:%S'` - Completed $FS
    umount "$TMP"
    zfs destroy "$FS@backup"
done

rmdir "$TMP"
umount "$DESTDIR"
echo `date +'%Y-%m-%d %H:%M:%S'` - Backup complete
zpool scrub tank
    

Notes

2018-09-04: I have finally upgraded the OS from Yakkety (16.10) to Zesty, Zesty to Artful and Artful to Bionic (18.04 LTS), using the approach outlined at https://andreas.scherbaum.la/blog/archives/950-Upgrade-from-Ubuntu-16.10-yakkety-to-17.10-artful.html. This has worked with no issues, however on the final upgrade from 17.10 to 18.04LTS, I did have to change /etc/default/grub as shown:
    # GRUB_HIDDEN_TIMEOUT=0            # removed
    # GRUB_HIDDEN_TIMEOUT_QUIET=true   # removed
    GRUB_TIMEOUT_STYLE=countdown       # added
    GRUB_DISABLE_OS_PROBER=true        # added
    
then reran update-grub before rebooting. The OS probing can safely be disabled for a single OS system.
I also had to modify my LXC setup. Previously my config files looked something like this:
    lxc.include = /usr/share/lxc/config/ubuntu.common.conf
    lxc.rootfs.path = /var/lib/lxc/myvm/rootfs
    lxc.uts.name = myvm
    lxc.arch = amd64
    lxc.mount = /var/lib/lxc/myvm/fstab
    lxc.network.type = veth
    lxc.network.link = br0
    lxc.network.flags = up
    lxc.network.hwaddr = 00:00:00:bf:0d:1f
    
I had to change them to
    lxc.include = /usr/share/lxc/config/common.conf
    lxc.rootfs.path = /var/lib/lxc/myvm/rootfs
    lxc.uts.name = myvm
    lxc.arch = amd64
    lxc.mount.fstab = /var/lib/lxc/myvm/fstab
    lxc.net.0.type = veth
    lxc.net.0.link = br0
    lxc.net.0.flags = up
    lxc.net.0.hwaddr = 00:00:00:bf:0d:1f
    lxc.apparmor.profile = unconfined
    
I'm sure apparmor security is a good thing in theory, but it prevents me from running any VMs with no explanation of why. It's the security equivalent of lock on my front door that I don't have the key for.