Observability That Actually Pays Off
Dashboards are cheap to make and easy to ignore. Here is how I decide what is worth measuring.
It is easy to drown in metrics. Every tool ships with a default dashboard, and before long you have forty panels nobody looks at. The hard part is not collecting data — it is knowing which signals are worth waking up for.
Start from the question, not the metric
Before adding a chart, I ask what decision it will inform. If a graph cannot change what I do next, it is decoration. The metrics that earn their place are the ones that answer a real question: Is the user experience degrading? Are we about to run out of capacity? Did that deploy make things worse?
The signals I keep close
- Latency at the tail. Averages lie. The p99 is where users feel pain.
- Error rate by cause. A flat error count hides the story; grouping by reason tells you where to look first.
- Saturation. Queues, connection pools, disk — the places that quietly fill up until everything stops at once.
Alerts are a promise
Every alert is a promise that someone will act when it fires. If an alert routinely gets silenced, it is not an alert — it is noise that erodes trust in the whole system. I would rather have five alerts I respect than fifty I ignore.
Good observability is not about seeing everything. It is about seeing the few things that matter, clearly, before they become incidents.