In my last post, I shared about my experience with a webhost suspending a hosting account which not only suspended the website but it also took out DNS resolution for the domain. Due to a lack of proper DNS resolution, systems that relied on DNS resolution failed miserably. Many things failed, however there was one warning that I never redceived: a “Website Down!” alert.
I should have received a “Website Down!” alert within minutes of the account being suspended. Why didn’t that happen? I thought about it for roughly 2 seconds before I figured it out (that was 1.5 seconds too long… but it was early in the morning so that might explain it). My website monitor is Site 24×7 from Zoho Corp. It is set to ping my server and report on latency as well as “uptime”. That uptime is based on ping response. However, ICMP is much lower in the stack than the website itself.
Before I go any further, I’d like to say that I’ve known and completely understood the importance of checking your assumptions, and specifically your assumptions about what a ping response is really telling you. However, sometimes in spite of your body of knowledge an assumption is still allowed through.
When my DNS records were suspended as part of the account suspension, the DNS resolution for my site changed to resolve to a different IP address. I’m not sure why, but I believe that it is a default action of my webhost’s DNS system to cause resolution to point to one of their general purpose servers in the event of an account suspension. As such, my ping test resolved properly and I never got a “Website Down!” alert.
Furthermore, merely checking for HTTP responses wouldn’t have likely helped since Apache would have been running normally. I specifically needed to check for the availability of a unique web page on my site, and not index.htm (I’m sure you can see why that wouldn’t be quite so helpful).
All this to say: When setting up monitoring and alerting, make sure you question what your monitors are actually monitoring and what your alerts are actually saying. Do you have any similar stories about poor assumptions? Any alerting system failures you’d like to share? Don’t be shy… we all make mistakes. =)