Alerts Are a Garden

Engineers tend to treat alert configuration as a one-time task. They define thresholds during a service launch, page themselves into the on-call rotation, and move on. This approach fails because alerts behave like a garden rather than a thermostat. They require ongoing attention throughout the life of a service, and the team that plants them once and walks away will return to find either a thicket of noise or a bed of weeds where real problems hide.

In a well-resourced environment, alerts derive from customer signals such as throughput, latency, and rate-limit consumption rather than from hardware proxies like disk space or CPU utilization. Product owners define the thresholds at which latency or error rates constitute a customer-visible problem. Performance testing establishes the IO wait percentage or kernel object counts a service can tolerate before degradation. These conditions describe a mature alerting practice, and most teams do not begin there. Engineers inherit legacy services, absorb new traffic patterns from upstream callers, and ship features faster than benchmarks can catch up. The thresholds set on day one rarely match the system six months later, and the gap between intent and reality widens with every deploy.

This drift is the core problem. An alert calibrated for last quarter’s traffic will either fire constantly or stay silent through real incidents. A threshold chosen without performance data reflects a guess, and guesses age poorly. New user flows shift the baseline. Dependencies change their failure modes. The signals that mattered at launch may no longer correlate with customer pain, and the only remedy is the steady work of revisiting each alert against current conditions, removing the ones that no longer indicate harm, and reordering the rest by the severity of impact they actually represent.

The principle that should guide this work is intolerance of false positives. A page at three in the morning for a non-issue costs more than the lost sleep. It erodes the engineer’s productivity for the following day, damages morale across the rotation, and, most consequentially, trains the responder to discount the next page. An engineer who has been burned by a false alarm at three is slower to react to a real incident at six, more likely to silence the notification, and in the worst case will sleep through the page entirely. False positives therefore deserve the same treatment as outages, which is prompt investigation, labeling, and adjustment.

The discipline this requires cuts both ways. Widening a threshold is the easiest response to a noisy alert and the one most likely to produce a silent miss. A threshold tuned too loosely means the first signal of a real incident arrives as a customer complaint, which is the outcome alerting exists to prevent. The correct response to a false positive is rarely to relax the threshold by a fixed margin, but to ask why the alert fired without a real problem behind it. The answer might point to a missing condition, a noisy upstream dependency, a metric that needs smoothing, or an alert that should be replaced by something closer to the customer experience. Each of these is a more durable fix than a wider tolerance, and each leaves the rotation better calibrated than it was.

Tuning a specific alert involves two dimensions that teams often conflate. The first is the threshold itself, the value at which the metric crosses from acceptable to actionable. The second is the duration the metric must hold that value before the alert fires. A latency alarm that pages on a single elevated data point will catch transient spikes that resolve before the engineer opens a laptop. The same alarm configured to fire only after five minutes of sustained elevation will ignore the spike but catch the genuine degradation. Adjusting duration is often the higher-leverage fix because it filters noise without sacrificing sensitivity to the underlying condition. Threshold adjustment, by contrast, changes what the alert considers a problem at all, and should follow from evidence about the service’s real behavior rather than from a desire to quiet the rotation. A team that tunes both dimensions deliberately will produce alerts that fire when something is genuinely wrong and stay silent when the system is merely noisy.

There is a further category of fix that teams often overlook, which is to change the application rather than the alert. When a service produces conditions that the team has decided to treat as false positives in perpetuity, the alert is reporting something real about the application’s behavior, even if the team has chosen not to act on it. A nightly batch job that drives queue depth past the alarm threshold every night, a retry loop that briefly saturates a connection pool on every deploy, or a memory pattern that triggers a warning every time garbage collection runs are not alerting problems. They are application problems that the team has agreed to ignore. The honest response is to fix the underlying behavior, because an application that requires its operators to maintain a standing list of conditions to disregard is one in which real degradation will eventually hide among the conditions everyone has learned to wave away.

Alerting is therefore not a configuration task but a continuous conversation between a team and its service. The thresholds, durations, and signals that constitute a healthy rotation reflect what the team currently knows about how the service fails and how those failures reach customers. That knowledge changes, and the alerts must change with it. Teams that schedule regular alert review, treat false positives as defects to investigate rather than nuisances to suppress, and remain willing to fix the application when the application is what is broken will build rotations that protect both their customers and the engineers who carry the pager. Teams that do not will discover, usually at three in the morning, that the garden has gone to seed.

Command Palette

Comments