@georgieboy We switched to a new monitoring & alerting system recently, then had a power failure where the DNS servers didn't come back. Our monitoring is done entirely by host names, and 'I cannot resolve this hostname' definitely causes probe failures.
There was oh so much email (when we got DNS working again). I was kind of impressed how fast our mail server could get through them.
(Now we have a special 'there is a large scale problem, I am shutting everything else up' alert.)
@cks Yeah, when I was in CS, the switch to which our Nagios server was connected was off for a weekend. Also, I'd never throttled alerts so they fired at every check, so roughly every 5-10 minutes per service, times about 200 services at the time. That did take long enough I had to rm mqueue/* as well; come to think of it, that might have been the last time I did.
Server run by the main developers of the project It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!