It’s been a week since the big maintenance window and everything seems to be working pretty well so far! I’ve been installing more missing packages and fixing little helper scripts (scr, shell-setup etc) on the main shell server that people have noticed were missing, but mostly I’ve been doing some more “behind the scenes” work. There’s been some work converting the signup program and blinkenbot plugin (which have some integrations with each other) from python2 to python3. Thankfully the actual conversion was very quick (mostly done using “2to3”), but there was some new bugs introduced from this change that has taken some time to track down and fix. I also changed to using python venv and fixed a minimal setup.py to install it as a package which makes things a bit more clean. It was also time to update a few of the questions in the quiz 😉 I worked out a few bugs this morning and the first two new user accounts created after the maintenance have been added!
I’ve also spent some time setting up new monitoring systems, previously I was using nagios for host and service checks and munin for metrics. Both of these systems are really quite old, so I have now switch to something much more modern: nagios4! 😛 Okok, I know people like to hate on nagios but it’s actually pretty useful for some cases! Now I’ll only be using it to monitor public facing services (check sshd, web etc is responding), and the checks will go over the internet to test the whole path through firewalls etc. In addition to this I’m also using telegraf/influxdb/grafana for more host monitoring and metrics (drawing graphs etc). I’m sure this is also a controversial choice and people think I should be using prometheus or some new software I’ve never even heard of, and it’s not out of the question that I would switch in the future. The setup was a bit time consuming and not as straight-forward as I would have thought, but it’s in a working state now at least even though there’s still a lot more stuff to add.
Another thing that I’ve been spending a lot of time on is some redundancy for part of the internet connections using OpenBSD carp and openbgpd. It’s not quite finished yet and I have not tested doing a failover but I think the concept is there at least. I might do a separate post on this if someone finds it interesting. I’m also planning a more technical post about some fail2ban setup stuff 🙂