I’m happy to report that the big maintenance window yesterday was successful and we are now running live on the new server hardware! I’ve spent a lot of time earlier this week doing final preparations and planning the exact steps to take during the migration because I knew there was going to be a lot of work to do during the migration and many things that could go wrong. I felt a bit nervous going to bed on Friday evening before the big day, but I also knew I had done a lot of preparation so I still slept very well 🙂
Saturday morning started with some final preparations relating to the network setup. I had to disconnect both internet and my management connection to the new server to be able to then change it over to the final network configuration. I didn’t want to have to spend lots of time just trying to get back in to the server if anything went wrong so I set up an of out-of-bands connection to the server via a separate laptop for emergencies, then I rebooted the firewall on the new server to apply all the final changes and crossed my fingers I would still be able to log in after moving some network cables over to their new ports, and it all worked out on the first try!
The next step was to copy over a lot of data to the new server. I had tested this out with some less important virtual machines earlier so I knew roughly what to expect. I shut down the web server at 09:43 (according to twitter) and started the copy which took around 20 minutes. Here I also have to convert from vmdk format of disks used by ESX into the qcow2 format used by the KVM hypervisor. Things progressed pretty much as expected here and I continued with the mail services, directory services, ircd and lastly shut down the SSH/triton server.
Once all the services were down I could start copying all of the user home directories. I had prepared this by doing a ZFS snapshot and transferring it to the new server the day before. I did a final snapshot and then started an incremental “zfs send” operation to sync over any files that had changed in the last day. This seemed to work well, but then there was an error message for just one directory. I went to investigate and found that some of the directories did not have the snapshot that I thought I transferred the day before. This is exactly the kind of problem I did not want to run in to at this point 🙂 I knew doing a full copy of all the directories would take somewhere around two hours and I did not feel like sitting around waiting for that long so I devised a little script that would transfer just the missing directories over which was much faster, crisis averted! 🙂 At this point it’s around 12:31.
The next step is to actually disconnect the old server from the internet entirely and move over to the new server and firewall setup. This involved some more network patch cabling, lots of firewall rules and some routing. This went pretty well but there’s always some firewall rules that you miss, which sometimes requires doing some tcpdump work to figure out what’s actually going on. Anyway, around 14:01 things were starting to look pretty good in this area as well.
Next was lots of messing around with NFS exports, since I moved from a Solaris based OS called Nexenta to Linux the options for sharenfs had changed somewhat. Also more firewall rules.
Mail was the first service I wanted to get up and running so that’s what I started working on next. I know mail servers should try and resend mail for something like two days before giving up, but if I ran in to any problems I wanted to have as much time as possible to figure it out before any emails would get lost. This went well and I only had minor configurations to update to get it up and running. I also started up directory services and ircd which went pretty well, just some regular OS updates etc. Now it’s around 16:22.
Web services were next, here I had to do some more troubleshooting and I had actually forgotten to copy some data for the wiki so I had go back and move that over. At 18:23 web was back up and I was starting to feel pretty confident 🙂
Lastly I started up the new SSH shell server that I have been preparing for about a month. It has the same hostname and ssh-keys etc as the old triton server and I tried to replicate the environment as well so hopefully it’s not totally strange. It’s running Ubuntu linux instead of Gentoo as I mentioned in the previous post. Here I had actually misconfigured the primary IP-address and when I went to change it I messed up the NFS mounts which make things very weird I had a hard time ever shutting the server off because of hanging processes. Eventually I got back using the correct IP and I sent a message on twitter at 20:48 letting people know it was possible to log back in again. I took a well deserved break and had a a fancy hipster beer and watched some Netflix to relax 🙂
As far as I know things are mostly working fine on triton now, but there has been reports of some weird color garbage on the terminal after detaching from gnu screen (time to change to tmux?). I haven’t able to figure this one out so if you know what’s causing it please message me. Also two users had problems this morning that their /home got unmounted so I had to manually remount it. I’m not sure what was causing this but I’ll keep an eye on it. Other than that it’s mostly been some missing packages etc that I have been installing as we go.
There’s still a lot more to do before the move is complete but so far I’m very happy with things and I’m a bit less worried about the old server breaking down. Next on the agenda for me is getting blinkenbot and signup back up and running. There’s also work to get IPv6 back, and also lots of work on the back end infrastructure. Please let me know if you want to read more about the new setup, I’m thinking I should write more about how the final setup looks like now (or when it’s more finished).