When you have a new alerting and monitoring system, 'who watches the watchmen' becomes an interesting and relevant question. Especially when the watchmen have a lot of separate components and moving parts.
I ended up writing a blog entry about how we're currently monitoring our Prometheus setup to see if anything in it has gone wrong. It's probably pretty standard and boring for basic Prometheus setups. The ultimate answer at the bottom of everything is 'a cron job checks that Prometheus and Alertmanager are up'.
Server run by the main developers of the project It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!
I ended up writing a blog entry about how we're currently monitoring our Prometheus setup to see if anything in it has gone wrong. It's probably pretty standard and boring for basic Prometheus setups. The ultimate answer at the bottom of everything is 'a cron job checks that Prometheus and Alertmanager are up'.
https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusSelfMonitoring