Wednesday, December 16, 2009
This Shit Is Killing Me
As you may have realized by now, the glowing reports about all being fine with the kernel upgrade on BOT House were premature.
BH is now choked. It's not taking new connections. Oddly enough the old connections are working fine. Odder still, anything that passes through BH (which, among other things, is the firewall here on DinkNet) is hunky-dory.
In fact I'm passing through BH right now to blog about this happy horseshit.
BH survived the initial upgrade for about 36 hours. When it died Monday morning (when I was absolutely unprepared to do anything about it), everything else on the network was humming along fine. I bounced it and things were, again, fine. Sometime during the day it choked again, but my connections from work to home stayed up. But I could do nothing on BH remotely.
This really pissed me off, but it made me all the more determined to figure out WHAT THE FUCKING FUCK IS WRONG WITH THIS FUCKING BOX.
Monday night when I bounced it again I started to simplify the firewall rules. And I installed conntrackd to do some statistics on the firewall.
Tuesday morning, all was well. I zipped off to work, nearly getting killed in the process (long story - almost got run off the road but I swerved clear and the car that almost hit me hit someone else and they both ended up crashing into the restraining wall), sat down at my desk and at about 10AM everything died again.
Plus, my connection from work went with it. I could not reconnect, but, again, everything going through BH was fine, even new connections. It was starting to look more and more like a firewall issue.
So Tuesday evening I took a closer look at everything. I shut down the proxy server on BH, which left ssh and nfs as the only services on the box (besides UT, that is).
I played a few EXCELLENT rounds of UT on BH (people started hitting it as soon as it was back up) and hit the sack at about 11:30PM.
I woke up the next morning (Wednesday) to find the box fucked again. It appeared everything choked right after midnight.
In fact it was becoming clear that every time it choked, it was at xx:02 AM or PM, which is meaningful since that is when the proxy project box does its all of its dirty work (this particular system continues to crank away while BH is down, BTW).
So I bounced the box, went off to work, and the thing dies once again on the hour of 10AM plus change.
This time around I had set up an alternate, pass-through ssh connection so I don't get locked out like Tuesday. It tunnels through BH directly to the box running EXP IV, which still shows no adverse reaction to the same kernel upgrade (apples to oranges? Same everything except it's a 64bit AMD dual core and BH is a 32bit Intel single core... hmmm...).
So that's where we were at Wednesday. Down.
When I got home Wednesday, I noted the time of the last conntrack log (once again "on the hour"), rebooted, and sat down to generate another 2.6.32 kernel image, which takes about two hours with the Debian make-kpkg tool.
This time around I took out SMP (Symmetrical Multi Processing) and Hyperthreading support (actually, hyperthreading simply disappeared as an option after SMP was removed). It is a single CPU box, after all, but it is a P4 and SMP support never seemed to matter in kernels past. While it was cranking away at the code I hit BH to see how that affected performance of UT, since building the kernel was chewing up most of the CPU cycles. No problems there.
Once the kernel & modules were built and installed, I had to rebuild iptables, ipset, conntrackd support and tools/libs, and reboot one more time.
Now, on Thursday (12/172009), BH has been running for a little less than twelve hours. Whether it will keep running is anyone's guess. I am optimistic that removing SMP was the way to go, since EXP IV, a true multicore system, has had zero problems with this kernel version.
And now I can get back to my other projects, like messing around with the TOO MANY MODS Windows UT99 server (I will be taking requests, so if you have some favorite maps or other UT99 extensions, let me know).
One slightly positive outcome of all this is that I leveraged a private (at the moment) Websense hack to get the pass-through connection back to EXP IV to work through the corporate proxy. It's an extremely small and elegant hack for completely bypassing monitoring and filtering that I've been working on for a few months now. I have contacted Websense but they seem to be ignoring me. It may be a unique flaw in our environment (I suspect it could be the Microsoft ISA servers), but my testing facilities are limited. We have a Websense upgrade scheduled for January and if the hack survives the upgrade I plan on shoving it up Websense's ass.
Or, I may just keep it in my private toolbox.