Monitoring is about tracking system parameters so that the health of the system is known.
Alerting
Alerting is a feature made possible by a monitoring system. When a parameter’s value exceeds a threshold, a few things might happen:
- Logged. It’s non-severe, but could be important for troubleshooting.
- Generates a ticket. Needs to be looked at, but not urgent enough to immediately investigate.
- Last resort: a notification is pushed to an on-caller for timely manual intervention.
In alerting systems, optimise for high signal-to-noise ratio, i.e. only page on-callers when there is very little doubt that something needs to be solved both urgently and with manual intervention.
White Box vs. Black Box Monitoring
White-box monitoring is about monitoring metrics internal to the system. Black-box monitoring is about probing the service as close as to how a client would use it.
- White-box is typically used more than black-box.
- Black-box tends to be symptom-oriented, meaning it’s there to tell you something is wrong, but isn’t as valuable during troubleshooting.
Metrics
The main metrics to monitor are:
- Latency — time taken to respond to a request.
- Traffic load — queries per second.
- Error rate — incorrect responses and server error responses, such as HTTP 500 for web servers.
- Saturation — memory usage/utilisation, CPU usage/utilisation, network bandwidth, storage usage, etc.
TODO: observability vs. APM vs. monitoring