Server maintanance done

The server maintanance is now done, and it took even longer than I guessed. 10+ hours of software and hardware upgrades, and battling with software bugs. And there will probably be some issues left to resolve in the comming days. 🙂

Please report bugs or strange behavious to me and I’ll take a look at it as soon as I can!

Here is the story of my day with upgrading the server, I ran into quite a few problems:

The hardware actually worked as expected, everything started up on the first try and all the memory was ok. Not even any dead harddrives! The software was a bit worse though. First up was ESXi upgrades, I had downloaded all the required patches the day before, and was ready to start. Or so I thought. It turned out I had downloaded patches for ESX instead of ESXi, doh!

Not a huge problem though, downloading the the first set of patches went in a couple of minutes. Today however, VMWares support site was “temporarily down for maintanance” almost the entire day. I managed to get two of the three required patches (4.1 U1 + 2 patches) by googling for the direct link instead of going through the patch search tool, but I could not find the last patch. I had spent quite a few hours on this already, so I felt I had to start making some progress. Installing the patches went much more smoothly when they were for the correct OS! (and after I figured out you can’t use vihostupdate.pl from behind NAT apparently)

I started booting up the virtual machines, and everything looked good. The main SSH server Triton would not quite boot properly though, but I could get a prompt halfway through the init process. I started looking around for anything odd, maybe some package which was not of the correct version. When doing an update of the package list the machine froze, and actually the entire ESXi host died with a “Purple screen of death”. After another PSOD or two I figured out it had something to do with network traffic, and google turned turned up something about a bug affecting ESXi 4.1 U1 on VMs with multiple CPUs and the vmxnet3 driver. It turned out this bug had been fixed in the patch which I could not find from before, frustrating! Many hours had passed by this time, and there was maybe an hour left until I was supposed to be finished with everything. Thankfully I managed to google the direct link for the last patch also (VMwares site still not working), and after the patch eveything seemed to be working fine.

I also hade some issues with upgrading NexentaStor, downloading the patches from the nexentastor shell did not work properly. I managed to manually the updates though, and then run the upgrade-thing from within the nexentastor shell. The second part of the upgrade was supposed to be from version 3.0.5 -> 3.1.0, but this did not work very well either. It turned out after some more googling that the had actually pulled version 3.1.0 from the servers, but not put up any message on the main site or anything that would give a hint when you tried upgrading from the CLI. Somewhere in the forums there was a post saying they were working on some distribution bug, so I could not get this upgrade at all.

Next up was a bug in the new Gentoo init system baselayout2/openrc, it only affected me because I did not use a pretty recent kernel option that apparently impoves boot performance or something like that (CONFIG_DEVTMPFS). I didn’t get much of a clue from the software this time either, it just stopped booting about halfway through the init process and froze. Much googling and a kernel compilation later it actually seemed like I was pretty close to finished, only a couple of hours after the deadline I had set 😛

Then I had to fight a bit with grsecurity and the autostart-scripts (root does not have read-permission in users home directories, but that’s where the autostart scripts are stored, messy…), but about 20.30 I could open up for logins to the main SSH server again!

I hope there will be atleast a couple of years until I have to take down the server and do software maintanance like that again, puh. Sorry for the longest blog post I’ve ever written, but maybe someone will find it interesting 😛

This entry was posted in downtime, maintenance, Uncategorized. Bookmark the permalink.

2 Responses to Server maintanance done

  1. Nik says:

    thanks providing such a great service! i don’t know what i would do without it, you make everything so simple! really appreciate your effort (and hassles) to keep it going.

  2. dabeowulf says:

    I appreciate this post. The efforts it takes for running such a service and insights to the inner workings of how specifics are getting accomplished always make for an interesting read.

Leave a Reply

Your email address will not be published. Required fields are marked *