Epic Uptime – Bragging Rights or Epic Fail?

I like big uptime numbers. I was unnaturally fascinated by the Cool Solutions sponsored “NetWare Server Uptime Contest”. Over six years without a reboot? Be still my heart.


Recently, this obsession of mine with seeing a server’s uptime being measured in years and not days was challenged. It all happened when I realized I needed to peruse through the available updates for a Windows SBS 2008 machine that I am responsible for. It was something that I had been putting off for too long. While attempting to conceal my rapturous joy at having to perform that task (and succeeding quite well), I realized I was not looking forward to the reboot(s) that were imminent.

I hate rebooting servers. Nothing good ever comes of it. Maybe that’s a bit pessimistic, but the fear I think is justified. So many complex services starting up simultaneously, some of them having been recently patched and most of them ordered in a hierarchical dependency chain, puts an unnecessary strain on my hairline.

Thinking about it some more, I realized that while my love of uptime was partially driven by a childish fascination with extremes, it was more a reaction to my fear of rebooting. Upon even further introspection, I saw that my fear of rebooting wasn’t so much a fear of rebooting as it was a fear of problems. SysAdmins hate problems.

However, in reality the reaction to my fears was causing my fears to actualize! Think deeply on that. There are greater applications of that thought than to the simple act of rebooting servers. Selah.

I turned to my Twitter companions for a quick opinion pole (I’m quickly recognizing Twitter as a great resource if used properly). I asked my followers if epic uptime was worthy of high-fives or if it was a sign of an unpatched server. The responses were unanimous:

Michael “@errr_” Rice: @Nonapeptide Epic Uptime == unpatched server . #sysadmin #epenis

Benjamin “@blueben” Krueger: @Nonapeptide It doesn’t take much more than dumb luck to keep a quiet back-room server up for a long time. We shouldn’t reward bad behavior.

Jonathan [email protected]_angliss” Angliss: @Nonapeptide depends on the platform. Usually means unpatched kernels etc, but there is ksplice now which aids it.

Jonathan [email protected]_angliss” Angliss: @Nonapeptide that being said, lack of regular reboots leads to other issues. #lopsa tech mailing list discussed this http://bit.ly/aTJlnS

Jonathan [email protected]_angliss” Angliss: @Nonapeptide generally speaking, reboots usually only apply for kernel type stuffs, windows is worse due to dll hell, and running services

Jason “@obfuscurity” Dixon: @Nonapeptide unpatched server == stupid + irresponsible + lazy

@jtimberman: @Nonapeptide Unpatched server. Service availability doesn’t require a single system to be up for ages. #SysAdmin

@dancarley: Sign of unmaintained machines + insufficient infra. Should always know what state a machine will be in after reboot.

Wow, it looks like I showed up to this thought party unfashionably late. My obsession with uptime spawned from a mild fascination with big numbers and a major allergic reaction to problems apparently needed to die.

What was most important for me to realize was that a schedule of controlled reboots will increase system stability and decrease the likelihood of a server not coming back up. It makes sense in retrospect, but I suppose I was frequently on the treadmill of reactionary administering and hadn’t paused to assess my assumptions.

And yet, I still like to have a quantifiable measure of success. To me, uptime meant success. If something had been running for three years, it meant that there was no problems with it (false reasoning, I know). So I posed another question to the Twitterverse. If server uptime is a poor metric, what do I use to measure a thing’s success?

@jtimberman: @Nonapeptide “Availability”. The infamous number of nines. #SysAdmin

@mibus:@Nonapeptide Service Uptime, not Server Uptime. Load-balance, cluster, whatever – users care about the Service, not the Server.

Jason “@obfuscurity” Dixon: @Nonapeptide It’s more than just service health or uptime. Don’t take the business effects for granted. Is the service doing it’s *JOB*?

Jason “@obfuscurity” Dixon: @Nonapeptide But seriously, enough of this “uptime” nonsense. I’ve said it before, there has to be PURPOSE to your monitoring. Correlate!

It makes perfect sense. Users don’t care if a server has been up for three years. If the thing is slow, has SMB authentication issues or is otherwise unhelpful, then the thing is a failure regardless of how infrequently it locks or otherwise requires a reboot. Success should be measured in a way that is abstracted away from the base hardware and OS that the service is running on.

Has the DFS cluster been able to service user requests at all times for the last three years even through monthly patching, reboots, OS upgrades and network infrastructure changes? It has? Wow. That, my friends, is success.

When you view success from service availability rather than individual systems’ uptime, you begin to realize that a service is dependent upon more than just it’s binaries or a single switch in a stack or whatever it is that you happen to be monitoring. Any service is reliant on its OS, which is reliant on its hardware, which is reliant on the network, which is reliant on the power, which is… you get the idea.

With that understanding of service availability, you can easily see what are the most important parts of your infrastructure and what ought to be monitored and how.

To finish up, I would recommend that everyone go and read the LOPSA thred that j_angliss referenced: http://lopsa.org/pipermail/tech/2010-April/thread.html#4324

If you read and think on the whole thing you’ve just earned a Bachelors of Awesomeness in Systems Administration. Here is just one of the great thoughts:

“And who knows what config changes have been made that will cause the machine or some service to fail to come up in the event of an unexpected/unattended reboot. I am seriously considering adding a nagios to check for every machine in our environment to issue a warning when a machine has been up for more than 3 months. If it hasn’t been rebooted in 3 months it seems less likely to come up properly or be up to date on patches.”Tracy Reed

“We came to the same conclusion at $WORK, it also helped highlight machines that were either single points of failure or that people were just flat out scared about.”Dean Wilson

Wow, Dean’s realization is true. Servers or appliances that haven’t been rebooted since men wore cummerbunds and women swooned is probably a sign of greater problems than unpatched services.

However, the idea of uptime being bad is not without intelligent opposition. For example, take this quote from the LOPSA thread:

“Having a high uptime does not necessarily mean that there have been no security updates, since you can update almost everything without a reboot.

Granted a reboot is required to update the kernel itself, but if your server is decently hardened and firewalled, exactly which kernel exploits are you vulnerable to?

I had a server that was online for over 1300 days, until it was rebooted by datacenter power issues. Since it rebooted anyway, I took the opportunity to install the only package that was not up to current, the linux-kernel. Did I suddenly feel safer? Not really :)” – Charles R Jones

It should be noted that even that argument had it’s share of detractors. Just read the thread a few times and come to your own conclusions. And when you do, post your thoughts here. This topic isn’t as easy as a “Yes it is!” or “No it isn’t!” decision.

So that’s it! You’ve just been witness to the death of a phobia and the birth of a much healthier and more logical outlook for this SysAdmin. I’m getting less nubby each day… thanks to the Twitterverse and some people smarter than me willing to share their experience.

How about you? Do you reboot servers on a schedule or do you dread reboots like papercuts on your eyeballs? Do you measure success with system uptime or with service uptime? As a bonus question, one that I wished I had more time to delve into, do modern-day Windows systems need more reboots for more patches than *nix machines?

Gotta go reboot that server now…


  1. Rob

    June 16, 2010 at 10:25 am

    I guess there’s an overriding theme there: if you’re scared to reboot something, you should go and do it right away.

    Either it’ll come back fine, and you don’t need to be scared any more, or it won’t, and you can sort the problems now, rather than at the same time as having to sort the other hundred servers in the datacentre that just had a power failure.

    As a few pointed out, *NIX systems tend to require less reboots to apply security patches that Windows, where just about every month’s updates seem to involve “critical system files”. I’ve got a Linux box sitting at just over four years’ uptime, but all the externally facing stuff is up to date, and I’m actually pretty sure it’d come back nicely if I did reboot it.

    A downside of rebooting often is that it may actually mask problems – that service that sits there gently leaking memory, for example. When we first implemented SVN here, we had to reboot Apache every few weeks to solve a bug in the WEBDAV handler. Load on a few more users, and that turned into every few hours.

    It shouldn’t be a *requirement* to reboot in order to keep things stable, so if you demonstrate a stable system after a year’s uptime, it’s probably a good indication that the software running on it is in good order.


    • Wesley.Nonapeptide

      June 16, 2010 at 10:32 am

      Very good point about rebooting possibly masking problems! That might be an interesting topic to pursue at a later date. I tend to think that a monthly reboot is a good thing. Anything that would crop up after several months or so of uptime might not be worth worrying about.

      Famous last words, I know.


  2. voretaq7

    June 16, 2010 at 10:30 am

    I’m definitely in the “Regular Reboots” camp, and have been since my first “Real Job” in the IT field. While there’s a lot to be said for a machine that has been up for 6 years (like your netware screenshot) most of it is bad.

    In FreeBSD-land we’ve had several kernel-land locally exploitable privilege-escalation issues in the last 8 years (to say nothing about the panic()-inducing DoS-type bugs that have been fixed). Not making sure your systems are up to date with those patches (which can’t be applied without rebooting) is pretty much gross negligence in my book.

    All that being said I feel any machine with an uptime over one year should only be rebooted with the most extreme caution: As mentioned a few times above you never know what config changes have been made which will bite you on reboot. (My favorite example was a machine that got renumbered: The admin changed the running IP configuration (via ifconfig), but not the startup configuration files. On reboot the machine reverted to its old IPs. Hilarity (insanity) ensued.)


    • Wesley.Nonapeptide

      June 16, 2010 at 10:35 am

      Yes, I’ve been nearly bitten by delayed reboots. Actually, it was more like “I made a change that needed a reboot, but forgot to do it later that night. I remember three weeks later. What was that change I made again?”

      Yes, I usually have a change log page for each server in my wiki, but once in a while I get lazy.


  3. tk

    June 17, 2010 at 11:09 am

    Due to a bug somewhere between VMware ESX 3.5 and the OpenSolaris kernel, I had a vm that would boot and show over 8000 days of uptime.

    tcsh-[110]% uptime
    8:18am up 8125 day(s), 12:19, 2 users, load average: 0.00, 0.00, 0.00

    Fully patched, and current, too. :)


  4. […] of them… June 18th, 2010 Goto comments Leave a comment The Nubby Admin has a great post on uptimes, and the old fascination of having a large uptime1. Okay, you can get your minds out the […]


  5. Twirrim

    June 19, 2010 at 2:22 am

    Service uptime, never server uptime.

    Colour me paranoid, but even with a server that is ostensibly locked away from the world, I don’t trust them to be safe. If there is an exploit available for the box, I want it patched and safe. You should never, ever, rely on other bits of security just for the sake of a kernel patch and reboot.
    Several years ago I was working for a large ISP / Hosting company. For one of our main web clusters the servers were locked away from the world nicely hidden behind firewalls and load balancers. Taking advantage of a 0 day exploits, a hacker got on through an XSS attack, exploited up to root and started trashing the system. Luckily quirks of the platform squashed them and their actions within about 5 minutes. Enough time to do some damage, but not irreparably so. If we’d unintentionally got another box they could have accessed from there, they could possibly have exploited that, and so on. It just doesn’t pay to run un-patched machines.

    If you can’t reboot a server at any time without disrupting service, you’ve got work that needs done to mitigate that problem. It shouldn’t even be necessary to post a maintenance window notice (though you always should). Ideally no one should ever notice a server has been or is being rebooted.


    • Wesley.Nonapeptide

      June 19, 2010 at 8:25 am

      “Service uptime, never server uptime,” is very well put.

      I’m always amazed that major institutions like banks and Microsoft have so much “planned downtime” for certain services. Why should my bank’s website be down, ever? They make billions of dollars. They can’t afford the systems necessary for service availability through patches or is there more complexity involved that I’m not aware of? Hmmm.


  6. Graycat

    September 7, 2010 at 2:11 am

    As of today my longest running Windows box is at 121 days of up time. Why? Well it’s an archive server that got missed in the last patching round so I’ll be doing that today.

    Generally though our Windows servers are patched on a monthly / every other month schedule so anything from 30 to 60 days between restarts is about right for us.

    As someone has said, it shouldn’t be the total amount of up time you’re looking at but the amount of *unscheduled* down time over a period (week / month / year / decade). I’m very much in this group and would happily reboot my Windows machines on a weekly basis if it helped with service uptime etc.

    Of course you’ve also got suck fun things as *nix and BSD machines or even switches to think about. I’ve got switches that haven’t had a moment of unscheduled downtime for years but have been off twice in the last six months due to power maintenance in the building over a long weekend (no point running the ups all weekend just for them).
    *nix machines are also interesting in that regard as without exception all of mine simply sit there and get on with their job. After a major office migration I’ve yet to turn some of them off even through their patching schedule.

    I suppose that’s something else to think about – the impact of patching on an infrastructure’s availability.


    • Wesley.Nonapeptide

      September 7, 2010 at 10:47 am

      Ah yes, the maintenance crew and their confounded power cuts. Last place I worked in, many of them had keycard access to the server room. Seeing reciprocating saws and tarps in the server room is not conducive to maintaining a nice, dark hair color.


  7. Arun

    April 2, 2013 at 8:54 am

    Oh come on, You almost sound jealous. Uptime is nothing but the time it was up without reboot. It doesnt mean much beyond that. It’s a great achievement. Stop feeling inadequate and just take it as a system that was up for more than 10 years. I have heard about only netware and linux systems do that, and only sysadmin of these systems shed a tear about having to turn off a dying hardware.


  8. […] as impressive a metric as that might have been, there’s some disagreement as to whether it is an indicator of success: it’s usually enough to avoid actively […]


Leave a Reply

Follow TheNubbyAdmin!

follow us in feedly

Raw RSS Feed:

Contact Me!

Want to hire me as a consultant? Have a job you think I might be interested in? Drop me a line:

Contact Me!

Subscribe via Email

Your email address is handled by Google FeedBurner and never spammed!

The Nubby Archives

Circle Me on Google+!

Photos from Flickr

Me on StackExchange

The IT Crowd Strava Group

%d bloggers like this: