Dashboards VS Alerts

A recurring argument in on-call discussions on Reddit and elsewhere holds that dashboards are redundant if a team’s alerts are good enough. The position has a kernel of truth. A team that surfaces the right customer-facing metrics through a sufficiently dense set of alerts can operate for a long time without looking at a dashboard between incidents. The position fails because it treats dashboards and alerts as competing answers to the same question. They are not. They answer different questions, and a team that conflates them ends up with engineers who get paged correctly and then land in the dark.

An alert is binary by design. It resolves one question, which is whether something is wrong enough to interrupt a person. A dashboard does something else. It shows what the system is doing across a window of time, and lets the engineer decide whether what they are seeing matches what they expect. Alert tuning, no matter how careful, will never produce that. It produces a more reliable threshold, which is a different artifact entirely.

The first job of a dashboard is to shorten the distance between the page and the diagnosis. When an alert fires, an engineer opens a dashboard before doing anything else. The alert supplies the interrupt. The dashboard supplies the context that turns the interrupt into a working hypothesis. A team with disciplined alerts and neglected dashboards produces an on-call experience where engineers know they need to act but spend the first ten minutes assembling the picture by hand. That time compounds across every incident the team handles.

The picture they assemble by hand is not free. An engineer without a dashboard reaches for logs or traces, which means querying a log store, scrolling through trace spans, or in the worst case opening a shell on a worker to read process state directly. Each of those is more expensive than reading a pre-computed metric. Logs and traces are high-cardinality records of individual events, retained at a cost that scales with traffic. Metrics aggregated into a dashboard are cheap to store and cheap to query, because the aggregation has already happened. A team that operates without dashboards is implicitly choosing to do that aggregation by hand, in real time, against the most expensive data store they have, while paying to retain that data long enough to be useful. A well-designed metrics layer lets a team reduce log and trace retention to the window where those tools actually earn their cost, which is investigating the recent past in detail. The dashboard carries the long view. Logs and traces carry the close view. Conflating the two pushes cost into the storage tier that is least suited to bear it.

The second job of a dashboard is to surface conditions that never reach an alert threshold. A well-tuned alert fires when something is unambiguously wrong. Many real failure modes develop gradually. Latency creeps. Error rates rise by tenths of a percent per week. Queue depth climbs over a quarter until a Tuesday afternoon spike pushes it over. None of these cross an alert threshold on the day they begin, and a system that relies only on alerts will not see them until they do. A dashboard that someone reads habitually, the way a pilot reads instruments, catches the trend in the weeks before the threshold. This is work that alerts cannot do, because alerts exist to suppress information until a specific condition is met.

The third job of a dashboard is to encode the team’s mental model of the service. Choosing what to display, in what arrangement, at what time scale, is an act of describing how the team believes the system behaves. A dashboard with stale panels, metrics nobody reads, or queries that no longer return data reflects a team whose shared model has decayed. The dashboard becomes a record of what the service used to be rather than a window onto what it is. This is harder to detect than a noisy alert because nothing fires when a dashboard goes stale. It simply stops being consulted, and the team adapts to operating without it.

Good dashboards therefore require their own form of gardening. The work is not identical to alert maintenance but the discipline is the same. Panels need to be removed when the metric they show no longer corresponds to a question the team asks. New panels need to be added when the team starts caring about a new dimension of the service. Time ranges need to be revisited as the rhythm of the service changes. The cost of skipping this maintenance is not a false page. It is a slow drift toward a dashboard that engineers open out of habit and close without learning anything.

Consider a worked example. An alert fires for elevated error rates on a checkout service. The engineer on call opens the service dashboard and sees that errors are concentrated on requests routed through one of three upstream payment providers, that latency to that provider has roughly doubled over the past forty minutes, and that the error rate on the other two providers is flat. The diagnosis at that point is essentially complete. Without that dashboard, the same engineer would be querying logs for error patterns, comparing timestamps, and trying to reconstruct the upstream distribution from individual log lines. The alert would have fired in either case. The difference between a five minute resolution and a forty minute resolution is whether the dashboard already existed when the page came in.

A defender of the alerts-only position might respond that the fix is one alert per provider, so that the comparison is implicit in which alerts have fired. That works at three providers and breaks at thirty. It also pushes the comparison into the alert layer, which is the layer least suited to carry it, and it adds thresholds to maintain where one dashboard panel would do.

Capacity planning is the other side of the same coin and rarely involves a page at all. The signal that matters is rarely the current state. It is the trajectory of the system over months, read against the trajectory of the business. A team that has been recording disk utilization, compute pressure, and queue depth as metrics for the last year can answer questions of the form: my disk has filled at five percent per month linearly, my average compute utilization has moved from seventy-five percent to eighty percent over the last six months, and we have onboarded a thousand customers at an average size of x during that period. Those three sentences together produce a defensible forecast for the next region launch or the next sales push. Without the metrics layer, that data may simply not exist. Logs and traces age out, often within weeks, because retaining them at the volumes a busy service produces is expensive. Pulling a one-off chart from log data assumes the log data is still around to pull from, and at six or twelve month horizons it usually is not. The team is then forced to forecast from what people remember about how the system behaved, which is a worse input than a chart and a more confident one than it deserves to be.

A team with strong alerts and weak dashboards knows when to act and spends the first ten minutes of every incident reconstructing context. The inverse team, with rich dashboards and thin alerting, often hears about its outages from customers first. Neither failure mode is rare, and both are predictable consequences of treating one practice as a substitute for the other. Alerts and dashboards belong to the same garden. They do not grow the same plants.

Alerts as Dashboard Substitution

Comments

More from this blog

Alerts Are a Garden

Command Palette

Comments

More from this blog