I like big uptime numbers. I was unnaturally fascinated by the Cool Solutions sponsored “NetWare Server Uptime Contest”. Over six years without a reboot? Be still my heart.

uptime.FAIL

Recently, this obsession of mine with seeing a server’s uptime being measured in years and not days was challenged. It all happened when I realized I needed to peruse through the available updates for a Windows SBS 2008 machine that I am responsible for. It was something that I had been putting off for too long. While attempting to conceal my rapturous joy at having to perform that task (and succeeding quite well), I realized I was not looking forward to the reboot(s) that were imminent.

I hate rebooting servers. Nothing good ever comes of it. Maybe that’s a bit pessimistic, but the fear I think is justified. So many complex services starting up simultaneously, some of them having been recently patched and most of them ordered in a hierarchical dependency chain, puts an unnecessary strain on my hairline.

Thinking about it some more, I realized that while my love of uptime was partially driven by a childish fascination with extremes, it was more a reaction to my fear of rebooting. Upon even further introspection, I saw that my fear of rebooting wasn’t so much a fear of rebooting as it was a fear of problems. SysAdmins hate problems.

However, in reality the reaction to my fears was causing my fears to actualize! Think deeply on that. There are greater applications of that thought than to the simple act of rebooting servers. Selah.

I turned to my Twitter companions for a quick opinion pole (I’m quickly recognizing Twitter as a great resource if used properly). I asked my followers if epic uptime was worthy of high-fives or if it was a sign of an unpatched server. The responses were unanimous:

Michael “@errr_” Rice: @Nonapeptide Epic Uptime == unpatched server . #sysadmin #epenis

Benjamin “@blueben” Krueger: @Nonapeptide It doesn’t take much more than dumb luck to keep a quiet back-room server up for a long time. We shouldn’t reward bad behavior.

Jonathan “@j_angliss” Angliss: @Nonapeptide depends on the platform. Usually means unpatched kernels etc, but there is ksplice now which aids it.

Jonathan “@j_angliss” Angliss: @Nonapeptide that being said, lack of regular reboots leads to other issues. #lopsa tech mailing list discussed this http://bit.ly/aTJlnS

Jonathan “@j_angliss” Angliss: @Nonapeptide generally speaking, reboots usually only apply for kernel type stuffs, windows is worse due to dll hell, and running services

Jason “@obfuscurity” Dixon: @Nonapeptide unpatched server == stupid + irresponsible + lazy

@jtimberman: @Nonapeptide Unpatched server. Service availability doesn’t require a single system to be up for ages. #SysAdmin

@dancarley: Sign of unmaintained machines + insufficient infra. Should always know what state a machine will be in after reboot.

Wow, it looks like I showed up to this thought party unfashionably late. My obsession with uptime spawned from a mild fascination with big numbers and a major allergic reaction to problems apparently needed to die.

What was most important for me to realize was that a schedule of controlled reboots will increase system stability and decrease the likelihood of a server not coming back up. It makes sense in retrospect, but I suppose I was frequently on the treadmill of reactionary administering and hadn’t paused to assess my assumptions.

And yet, I still like to have a quantifiable measure of success. To me, uptime meant success. If something had been running for three years, it meant that there was no problems with it (false reasoning, I know). So I posed another question to the Twitterverse. If server uptime is a poor metric, what do I use to measure a thing’s success?

@jtimberman: @Nonapeptide “Availability”. The infamous number of nines. #SysAdmin

@mibus:@Nonapeptide Service Uptime, not Server Uptime. Load-balance, cluster, whatever – users care about the Service, not the Server.

Jason “@obfuscurity” Dixon: @Nonapeptide It’s more than just service health or uptime. Don’t take the business effects for granted. Is the service doing it’s *JOB*?

Jason “@obfuscurity” Dixon: @Nonapeptide But seriously, enough of this “uptime” nonsense. I’ve said it before, there has to be PURPOSE to your monitoring. Correlate!

It makes perfect sense. Users don’t care if a server has been up for three years. If the thing is slow, has SMB authentication issues or is otherwise unhelpful, then the thing is a failure regardless of how infrequently it locks or otherwise requires a reboot. Success should be measured in a way that is abstracted away from the base hardware and OS that the service is running on.

Has the DFS cluster been able to service user requests at all times for the last three years even through monthly patching, reboots, OS upgrades and network infrastructure changes? It has? Wow. That, my friends, is success.

When you view success from service availability rather than individual systems’ uptime, you begin to realize that a service is dependent upon more than just it’s binaries or a single switch in a stack or whatever it is that you happen to be monitoring. Any service is reliant on its OS, which is reliant on its hardware, which is reliant on the network, which is reliant on the power, which is… you get the idea.

With that understanding of service availability, you can easily see what are the most important parts of your infrastructure and what ought to be monitored and how.

To finish up, I would recommend that everyone go and read the LOPSA thred that j_angliss referenced: http://lopsa.org/pipermail/tech/2010-April/thread.html#4324

If you read and think on the whole thing you’ve just earned a Bachelors of Awesomeness in Systems Administration. Here is just one of the great thoughts:

“And who knows what config changes have been made that will cause the machine or some service to fail to come up in the event of an unexpected/unattended reboot. I am seriously considering adding a nagios to check for every machine in our environment to issue a warning when a machine has been up for more than 3 months. If it hasn’t been rebooted in 3 months it seems less likely to come up properly or be up to date on patches.”Tracy Reed

“We came to the same conclusion at $WORK, it also helped highlight machines that were either single points of failure or that people were just flat out scared about.”Dean Wilson

Wow, Dean’s realization is true. Servers or appliances that haven’t been rebooted since men wore cummerbunds and women swooned is probably a sign of greater problems than unpatched services.

However, the idea of uptime being bad is not without intelligent opposition. For example, take this quote from the LOPSA thread:

“Having a high uptime does not necessarily mean that there have been no security updates, since you can update almost everything without a reboot.

Granted a reboot is required to update the kernel itself, but if your server is decently hardened and firewalled, exactly which kernel exploits are you vulnerable to?

I had a server that was online for over 1300 days, until it was rebooted by datacenter power issues. Since it rebooted anyway, I took the opportunity to install the only package that was not up to current, the linux-kernel. Did I suddenly feel safer? Not really :) – Charles R Jones

It should be noted that even that argument had it’s share of detractors. Just read the thread a few times and come to your own conclusions. And when you do, post your thoughts here. This topic isn’t as easy as a “Yes it is!” or “No it isn’t!” decision.

So that’s it! You’ve just been witness to the death of a phobia and the birth of a much healthier and more logical outlook for this SysAdmin. I’m getting less nubby each day… thanks to the Twitterverse and some people smarter than me willing to share their experience.

How about you? Do you reboot servers on a schedule or do you dread reboots like papercuts on your eyeballs? Do you measure success with system uptime or with service uptime? As a bonus question, one that I wished I had more time to delve into, do modern-day Windows systems need more reboots for more patches than *nix machines?

Gotta go reboot that server now…