The Wisdom of Specificity in Monitoring and Alerting

In my last post, I shared about my experience with a webhost suspending a hosting account which not only suspended the website but it also took out DNS resolution for the domain. Due to a lack of proper DNS resolution, systems that relied on DNS resolution failed miserably. Many things failed, however there was one warning that I never redceived: a “Website Down!” alert.

I should have received a “Website Down!” alert within minutes of the account being suspended. Why didn’t that happen? I thought about it for roughly 2 seconds before I figured it out (that was 1.5 seconds too long… but it was early in the morning so that might explain it). My website monitor is Site 24×7 from Zoho Corp. It is set to ping my server and report on latency as well as “uptime”. That uptime is based on ping response. However, ICMP is much lower in the stack than the website itself.

Before I go any further, I’d like to say that I’ve known and completely understood the importance of checking your assumptions, and specifically your assumptions about what a ping response is really telling you. However, sometimes in spite of your body of knowledge an assumption is still allowed through.

When my DNS records were suspended as part of the account suspension, the DNS resolution for my site changed to resolve to a different IP address. I’m not sure why, but I believe that it is a default action of my webhost’s DNS system to cause resolution to point to one of their general purpose servers in the event of an account suspension. As such, my ping test resolved properly and I never got a “Website Down!” alert.

Furthermore, merely checking for HTTP responses wouldn’t have likely helped since Apache would have been running normally. I specifically needed to check for the availability of a unique web page on my site, and not index.htm (I’m sure you can see why that wouldn’t be quite so helpful).

All this to say: When setting up monitoring and alerting, make sure you question what your monitors are actually monitoring and what your alerts are actually saying. Do you have any similar stories about poor assumptions? Any alerting system failures you’d like to share? Don’t be shy… we all make mistakes. =)


  1. Matt Simmons

    April 30, 2010 at 10:13 am

    Oh man….me and assumptions…we go way back :-)

    DNS is one of those really twitchy things, too, because one small error has untold ramifications. A large problem like the one you experienced wreaks havoc.

    My biggest assumption when monitoring is always that if I don’t get alerts, then things are fine. Before I had two Nagios servers watching each other’s backs, if Nagios goes down, how do you know? For that matter, even if you’re monitoring it, if your only network connection goes down, how do you find out?

    Yeah, I’ve got a loooong history of bad assumptions. I’d assume that I’m getting better, but that’s just part of the problem 😉


  2. Jason Dixon

    April 30, 2010 at 10:50 am

    The problem with “uptime monitoring” and monitoring in general is that we’ve been trained to focus on availability, rather than purpose. It’s not easy to make good monitors; it takes knowledge of the application, insight into the unexpected and a commitment to perfection (excuse the cheese).

    Anyhoo, I’ve written on this issue a couple times already.


  3. Wesley.Nonapeptide

    April 30, 2010 at 1:42 pm

    @Matt: I sympathize concerning the assumption that if I’m not getting alerts I’m okay. I have a backup program that is supposed to email me upon failure. It’s poorly implemented and sometimes that email doesn’t reach me. For those situations, I create recurring tasks in Outlook to check on various systems. It seems to defeat the purpose of automation if you have to manually check it… but such is life as long as the second law of thermodynamics is in effect.

    @Jason: Sorry your comment got moderated. WordPress captures comments that have more than one URL in it. I’ll ease up on that.

    Moving on, you’re exactly right about focusing on purpose over availability as well as needing to have insight into the unexpected. It was totally unexpected that my web host would switch the resolution of my domain to one of their other servers upon account suspension. I would have assumed that all resolution for my domain would have been suspended. Doesn’t that seem odd to anyone else?


  4. Doug Luxem

    April 30, 2010 at 2:22 pm

    Reiterating what Jason said – the key when monitoring is to look at it from the perspective of the service offered to end user, not the individual components. You could monitor the HTTP port, DNS queries, the Postgre back-end; or instead, have a monitoring platform that can enter a username and password on your login page (or whatever) and make sure it gets a valid response back.

    The other benefit is that you can monitor and alert on response time of the total system whereas the ICMP round trip is not really useful.

    Regarding switching web servers via DNS – that sounds like typical hosting extortion :)


  5. Wesley.Nonapeptide

    April 30, 2010 at 2:33 pm

    @Doug: “the key when monitoring is to look at it from the perspective of the service offered to end user”

    Quoted for truth. What tools do you typically use for that kind of service monitoring?

    I think the key to Systems Administration / Engineering in general is to look at everything from the perspective of the service being offered to the user.


  6. Jason Dixon

    May 1, 2010 at 7:16 am

    We’re attempting to address this chasm in thought with Circonus ( The emphasis should be on metric collection, not some interpretation of what they mean (that is best left to the user). Enable the user to correlate any disparate metrics for trending and root cause analysis… it’s a very powerful method (and tool) that to date has been vastly underutilized.

    P.S. Sorry for the sales-pitch-esque answer, but we’re passionate about this stuff. :)


  7. Wesley.Nonapeptide

    May 1, 2010 at 5:13 pm

    @Jason – Good idea about focusing on the collection and less on the interpretation. Don’t worry about pitching a product even if it’s your own. I’m all for people talking about what they believe in even if they also get paid from it.

    I like to think of my blog as a free-for-all. If anyone wants to say anything they’re welcome to (except if it involves R0l3X watches), but just be sure to back it up and be ready to get some tough questions from other commentors (or myself). =)


Leave a Reply

Follow TheNubbyAdmin!

follow us in feedly

Raw RSS Feed:

Contact Me!

Want to hire me as a consultant? Have a job you think I might be interested in? Drop me a line:

Contact Me!

Subscribe via Email

Your email address is handled by Google FeedBurner and never spammed!

The Nubby Archives

Circle Me on Google+!

Photos from Flickr

Me on StackExchange

The IT Crowd Strava Group

%d bloggers like this: