Skip to content

Incident Management

When a monitor breaks at 3am, five things have to happen — fast and in the right order:

  1. An incident is opened that records what’s broken and when.
  2. Someone is identified as the on-call responder.
  3. They’re paged through their preferred contact methods.
  4. If they don’t ack, the page escalates to the next person.
  5. No one gets paged for known maintenance.

SiteQwality’s incident management module wires these five together. You can use any of them in isolation — incidents without on-call, manual incidents with no monitors, notification channels with no on-call schedule — but the value compounds when they work together.

┌──────────────────────┐
│ Monitor flips status │
└──────────┬───────────┘
┌──────────────────────┐
│ Incident opened │ ←── manual incidents go here too
└──────────┬───────────┘
┌──────────────────────┐
│ Notification group │ ── channels: email, Slack, webhooks, etc.
│ (account-level) │
└──────────┬───────────┘
┌──────────────────────┐
│ Escalation policy │ ── targets: users + on-call schedules
│ (per incident) │
└──────────┬───────────┘
┌──────────────────────┐
│ User contact methods │ ── personal: email, SMS, phone-call, Slack DM
│ × notification rules │ ── personal: "high urgency? SMS at 0min, call at 5min"
└──────────────────────┘

The split between notification groups (account-level “where this monitor’s alerts go”) and contact methods + rules (per-user “how I personally want to be reached”) is intentional. It lets the team configure routing once and lets each engineer set their own delivery preferences.

  • Don’t open incidents for transient flakes. Set the notification group’s delay_send_after_minutes so a check that fails once and recovers within a minute doesn’t page anyone. See Notification groups.
  • Don’t escalate everything. Reserve escalation policies for high-urgency incidents. For warnings, a single Slack channel notification is enough.
  • Don’t put your CTO on every escalation policy. They’ll mute everything within a week. Build escalation chains with clear ownership: primary on-call → secondary → engineering manager.