In a previous post, byrona brought up a valid point:
For our internal stuff we also apply the same process, use a canned template and modify as necessary. Internally we also use the application monitoring functionality for some interesting things such as monitoring our facilities for HVAC status, UPS status, data-center temperature, data-center humidity and eventually per cabinet power usage as well.
In order to correlate any given alarm(s) to a root cause, monitoring across the entire stack is a must-have. I always use the example of a physical 19” rack stacked with VMware ESXi-hosts. One of those physical hosts is experiencing a failure on one of its PSUs. Due to that failure, the host is automatically placed in maintenance mode. During the evacuation of the host (using vMotion), an alarm goes off for one of the VMs being migrated, because it disappeared from the network for a second or so.
The root cause is obviously the failing PSU, but without a monitoring solution that has picked up on that, the root cause is much harder to locate. And even when you are monitoring the correct stuff, the root cause might still be external to those objects monitored. It could very well be that the temperature in that part of the rack is much higher than anticipated, because the amount of power and Ethernet cables in that part of the rack prevents heat from being dissipated correctly, which had led to the PSU to fail.
I think everybody agrees with me in that monitoring your hardware stack and meta-information about the physical layer (like temperature, humidity and others indicated by byrona) is critical; but how do you actually accomplish this?
Do you spec your DC to be ‘monitoring-capable’? If you're renting cage space from a co-location provider, how do you co-operate with them so you can monitor ‘their’ objects and metrics?