Cascading Failure, Technical Debt, and Punching a House with my Face

At 11:32PM, Saturday May 11th, I got an email from MX Toolbox notifying me that a SBS 2008 machine that I support had gone unresponsive. It’s 600 miles away from me in another state. This was not a strange occurrence with this server.

A Cluster of Prior Failures

Five years ago a small office with a minimal budget needed a SBS implementation. I recommended an HP ML 115 G5 with four hard drives and onboard RAID provided by an NVIDIA chipset. I have regretted that decision for all five years. Here’s a post of mine concerning that chipset and the troubles I’ve had with it.

In short, I have poor insight into and control over the entire server’s health. Some examples include:

  • I couldn’t update the hard drives’ firmware, which was a big deal because the serial numbers of those hard drives fell into a set of drives that have a known problem with suddenly going offline. The firmware update has to be applied through HP’s support tools, which are not supported on the ML 110/115. After much research and seeking help from HP, I was told that, in essence, I was left out to dry.
  • The ML 110/115 does not support the ProLiant Support Pack nor does that model support the Insight Control Manager. Keeping drivers updated and staying abreast of the various components’ health was virtually impossible.
  • There was also no HP ILO CLI interface available which made doing things like firmware updates especially difficult remotely.
  • The on-board storage controller had poor support form Nvidia, and offered very slim storage management features or reporting on hard drive health.

For years I hit the management ceiling with that box which probably cost my client more of my time and theirs than had a more robust server been purchased for twice the hardware cost. And then what I had been dreading for years finally happened…

Two Months Ago

“Did you reboot the server?” That’s never a question you want to hear, especially when you did not reboot a server. I VPN’d into that office’s network and checked for the presence of the server on the network. Yes, the server was down. One power cycle later, the OS loaded just fine.

I checked the event logs and it turns out there was a massive flurry of parity errors that came out of nowhere. The server froze as a result. The controller was apparently dying. After a reboot, the data appeared fine, and there were no more parity errors coming from the Nvidia storage driver. I knew something had to be done, but being remote and working with an office that has a shoestring budget (and can often only afford used shoestrings) made the options few and unattractive.

What’s worse, as I started investigating things further, I noticed that the ILO Advanced card that was in the server was no longer showing on the network. Aaaaand the BIOS clock would reset to July 2009 after being shut down (BIOS battery dying) causing strange problems with Active Directory and other applications running on the network that relied on accurate time (read: everything). AAAaaaaaand the two mirror sets (one for the system volume and one for the email server’s databases) had split apart and could not be re-synced because the Nvidia storage management software no longer recognized that any hard drives were connected.

The options, as I saw them, were for the business to either buy a new RAID controller, BIOS battery, and perhaps ILO card (and then scramble to perform the complex surgery remotely on their own, or pay a local consultant to coordinate with me, or pay to ship me on site) or get a new server altogether (and pay a local consultant to coordinate with me, or… you get the idea). Either way, it started to look more and more like a total forklift migration was necessary.

Two Months Later

Yes, it’s been about two months and the server is still riding in the same perilous state. Split mirrors, bi-monthly freezes that require a power cycle to recover from, and a lot of hoping and praying that data is not corrupted. Welcome to the world of supporting small business IT where people re-use tea bags and don’t run heat or AC in order to save money and keep the business open.

That Saturday night, it was getting late and I was thinking about bed. I checked my email one last time for anything pressing when I saw a MX Toolbox alert. This is never good. I scanned the email, saw what host was causing the alert, and knew that I was dead in the water. I could get into that client’s network via both a SonicWall VPN and unattended TeamViewer installations that existed on most of the workstation PCs. However, it was all futile because I didn’t have hardware level access to the server as a result of the ILO’s failure. The office has a Lantronix Spider KVMoIP device that was being used to work on a workstation migration for one employee, and was therefore not hooked up to the main office server. That was two layers of out of band management that was not doing any good for the most important technology asset in the building.

All of this meant that someone would have to show up at the office to power cycle the PC. The technical debt and compound interest of failure had already mounted fairly high by that point, considering the state of the server. However, things were about to get comical.

I’ll Gladly Pay You Tomorrow for Out of Band Management Today

What happened in the next 24 hours was a morbid comedy of oversights and compounded problems that ended in a whiplash inducing facepalm.

First, I needed to email three people who would most likely be in the vicinity of that office so I could coordinate with one of them to drop by on their Sunday morning and power cycle the server. Except the server is what does email for the organization so I can’t send to their organization email addresses (this is a Microsoft SBS machine). I only know of one employee’s non-work address, and I also happen to know the gmail address of another employee’s son.

I email those two people and tell them of the situation. As it turns out, two key workers are out, traveling to a convention in Texas. That makes access to email even more vital than normal. Everyone knows the situation and there’s not much more I can do so I get to bed. It’s not until about 2PM on Sunday, Mother’s Day here in the USA, that I hear back from one worker who has just enough time to skip by the office and power cycle the server.

Myself, I’m in the midst of a Mother’s Day dinner with my own family so I had ditched my phone… just moments before the employee called me from the remote office. I missed the call and the employee left a voicemail expressing a state of confusion over which server to power cycle. The organization is small and only has two servers. One is the SBS machine and the other is a HP MicroServer that is used as a network monitoring station and catchall for various extraneous services. I had assumed that over the years everyone had each server’s role understood by sight so I simply asked him to power cycle the SBS server, expecting that it would be known which piece of hardware that was. The fellow power cycled both servers since he couldn’t get in touch with me directly.

Okay, no big deal. The MicroServer is just running CentOS and OpenNMS. They’re resilient and can handle a sudden shutdown. As I listened to that voicemail, I checked to see if I could remotely connect to the server that had been down all night. I couldn’t. Great. Time to call the office and talk to the person who was on site and see what else could be done. Except the voicemail had been left over an hour ago and the employee had naturally left shortly after power cycling the server. I called his cell phone back, but he’s didn’t pick up. I left a voicemail.

A little later that Sunday I get in touch with another employee in the area who lives closer. He’s on his way out to pick up Mother’s Day dinner for his wife and can swing by to check out the server. First, I have him power cycle it again. Maybe the first guy just clicked the power button and didn’t hold it in? I held out hope for such a simple explanation. However, after I instructed this second person on how to make sure the server had shut down and then powered up, I waited for the duration of the standard bootup but nothing was showing up. It became apparent that the server was not coming back online.

“Do you know where the Spider is?” I asked hopefully. “No, I dunno where the other guy put it.” Gah! The Spider is a well known piece of equipment in that office, and it’s very rare that it can’t be found. I was about to concede defeat for that Sunday when, after some searching, the employee found the Spider. A few minutes of scrambling around and he had the thing hooked up to the server. Except… now I couldn’t get to the Spider. The fellow had to leave to pick up dinner and I wasn’t about to ruin his family’s Mother’s Day so I told him I’d see what I could do remotely, expecting nothing to be successful.

In the process of hooking up the Lantronix Spider, the employee had pulled the network cable out of the server and put it into the Spider. Then from the spider’s cascade port (it’s essentially a one port switch) he had connected a patch cable to the server’s LAN port. That made me wonder… perhaps it was a port on the ProCurve switch that was bad? That would explain both the server and now the Lantronix Spider being inaccessible. Or maybe the port spontaneously shut down as a result of some bug. Crazier things have happened.

I browsed to the switch’s management interface. “Please enter your username and password!” Okay, no problem! “Wait… I can’t remember what the password is… NOOOOOO!” The organization uses KeePass to store important passwords and software keys. The KeePass file is on the server. The server that is down.

But wait! I have a copy of the keepass databases on my own storage. Once a month or so I copy the files to my local storage so that I have an in-sync copy just in case. Whew! I find the switch’s login credentials and begin inspecting things. I looked, hoping for some bad news concerning the switch’s health (at least that would mean the server was okay), but the switch looked perfect. Nothing was amiss.

I’ve always been told to troubleshoot network problems from the lowest layer first. I had pretty much ruled out the physical layer. Layer 2 seemed healthy. Not much that can go wrong on a small, single subnet LAN. Layer three, IP… IP addresses… I gritted my teeth. I knew what the problem was. The Lantronix Spider is set to pick up an address via DHCP. Specifically it’s a DHCP reservation on the network’s DHCP server. The server that’s down. I wanted the network layer benefits of a static IP address, however I also wanted it to be easily portable between networks. My original idea was that the Spider could be used to support PCs on other LANs, like perhaps workers that were based in their home office that didn’t come into the organization’s building very often. With the Spider getting an IP address via DHCP, I could just tell someone to take it home with them and I’d only be left with walking them through configuring port forwarding, or getting TeamViewer set up on a PC on their LAN so I could get in and access the Spider via a local web browser. Except now the Spider was barking out forlorn DHCP discover packets and not getting any response back.

I fired up Network Monitor on an office PC to be sure. Yep, there it was. A DHCP discover request broadcasting every sixty seconds or so. Okay, I can handle this. The small office has a SonicWall firewall that has DHCP services on it. I only need to enable them, check its list of leases to find what IP address it was given, and I’ll be good! I mosey my web browser on over to the firewall’s administrative page. I stare at it. It wants the password for the admin user. “Password… password… I had to change it a few weeks ago. What did I choose…”

Oh well, I’ll look in the organization’s copied password file that I keep on my local storage! Yay foresight!! I found the firewall admin password and entered it. “Password Failure. Please Retry.” What?! Then I remembered that I had changed the firewall password due to security policy about two weeks ago. However, I hadn’t copied the organization’s password file to my local storage in a month. I had the old password in my copy of the password file, but not the new one. The new one was on the server that was currently down. Backups are taken every few hours, but a restoration needs to be done on functioning hardware. Super.

So that means I did it again. I couldn’t log in to the interface because I didn’t have the long password committed to memory. For super important passwords like that, I do keep a disaster recovery hard copy around. It’s essentially a few pages spelling out the most important usernames and password for the organization. However, only two people have that physical copy of information. While I could call them up and have them read off the password to me, I wasn’t ready to do that.

Instead, I turned to the HP MicroServer running CentOS 6. I have OpenNMS installed on it and have plans to install some ticketing software and maybe smokeping or M/Monit. Now, however, it’s going to be an impromptu DHCP server. Fortunately I can remember the password for the MicroServer! A quick ‘yum install dhcpd’ later and… “Couldn’t resolve host ‘'” WHAT DEVILRY IS THIS?! But of course; DNS for the network is performed by the SBS server… which is down. After facepalming, I changed resolv.conf to point to OpenDNS and continued my march towards a functioning DHCP server on the network. After a few minutes I have dhcpd running and it quickly hands out a lease to the Spider.

And it was then that I saw it. After logging into the Spider, I viewed the remote console and saw a Windows installation screen on the server. Suddenly, I remembered what happened. In the process of preparing for a migration away from the failing hardware, I needed to experiment with making an unattended installation file. I had a remote worker put the SBS 2008 install CD in the main server’s tray. Of course, rebooting caused the server to boot into the high boot priority CD drive. I sat in horror, thinking about my cascade of failures. Nevertheless, that wasn’t the time to flail in self loathing. I simply needed to hit “cancel” and get out of the installation welcome screen to boot from the hard drive.

Except the Spider was unable to interact with the server as a remote keyboard or mouse. I’ve used the Spider on that very server in the past, and it worked great at all stages of the boot process. In the years that I’ve worked with that office I’ve had to check BIOS settings, ILO firmware settings, and storage controller settings, all using either the Spider or the ILO itseld. But now, for some unexplained reason, the Spider was not able to input anything. I couldn’t move the mouse, I couldn’t press keys. So I sat and stared at the remote video in complete disbelief.

It was a simple matter of leaving a voicemail for someone and telling them to remove the disc from the DVD drive the next time they were in the office. The next morning the worker that I left a message for did just that, power cycled the server, and it booted up as normal. Life continued.

I was abashed.

More about my conclusions concerning the situation later. In the mean time, got a similar story to share? Let me know in the comments below or contact me and you can write a guest blog post about it.


  1. Andrew

    May 17, 2013 at 6:41 am

    Wow, that’s quite a lot of fail! Thanks for sharing.

    I’d love to read a follow up postmortem of lessons learnt, workarounds and solutions implemented.


    • Wesley David

      May 17, 2013 at 10:51 am

      Yes, I’m considering writing a post like that. I’m still in the midst of recovering this office, but I can still see a lot of room for improvement and things I could have done to prevent most of the failures.


  2. rcxb

    May 17, 2013 at 12:01 pm

    Yeah, continuing to depend on that server which lacked any out-of-band management was a big all-my-eggs-in-one-basket mistake that bit harder this time than most…

    One of my favorite budget-minded network admin tricks is a $40 WiFi AP/router that can run DD-WRT or other Linux firmware. A low-spec but fully-functional Linux box with two network interfaces, and a USB port that can connect to a large number of usb-serial dongles (which can be connected to your server’s serial port, as well as any network equipment lacking ipmi). It’s incredibly useful, even just to have as a ping target you can move around the network when setting things up, or tracking down problems. The fact that it can also act as your NAS, print server, muzak player, etc, is just fringe benefit, on a $40 device.


    • Wesley David

      May 17, 2013 at 12:06 pm

      That’s a good idea on how to make a serial -> Ethernet gateway, in essence.


  3. […] regard to my post “Cascading Failure, Technical Debt, and Punching a House with my Face“, I was asked about my conclusions and how I dug myself out of that […]


Leave a Reply

Follow TheNubbyAdmin!

follow us in feedly

Raw RSS Feed:

Contact Me!

Want to hire me as a consultant? Have a job you think I might be interested in? Drop me a line:

Contact Me!

Subscribe via Email

Your email address is handled by Google FeedBurner and never spammed!

The Nubby Archives

Circle Me on Google+!

Photos from Flickr

Me on StackExchange

The IT Crowd Strava Group

%d bloggers like this: