Incident Management

When a monitor breaks at 3am, five things have to happen fast, and in the right order:

An incident is opened that records what’s broken and when.
Someone is identified as the on-call responder.
They’re paged through their preferred contact methods.
If they don’t ack, the page escalates to the next person.
No one gets paged for known maintenance.

SiteQwality’s incident management module wires these five together. You can use any of them in isolation (incidents without on-call, manual incidents with no monitors, notification channels with no on-call schedule), but the value compounds when they work together.

The five building blocks

Incidents Record + timeline of an outage. Auto-opened by monitors, or created manually for non-monitor events.

On-call schedules Layered rotations across timezones with overrides for holidays and handoffs.

Escalation policies Ordered list of who to page next if no one acknowledges, with timeouts.

Notifications Channels (email, SMS, Slack, Teams, Discord, Telegram, webhooks) and groups, plus per-user contact methods.

Maintenance windows Silence specific monitors during planned work. One-time or recurring.

How they fit together

┌──────────────────────┐
│ Monitor flips status │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ Incident opened      │  ←── manual incidents go here too
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ Notification group   │  ── channels: email, Slack, webhooks, etc.
│ (account-level)      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ Escalation policy    │  ── targets: users + on-call schedules
│ (per incident)       │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ User contact methods │  ── personal: email, SMS, phone-call, Slack DM
│ × notification rules │  ── personal: "high urgency? SMS at 0min, call at 5min"
└──────────────────────┘

The split between notification groups (account-level “where this monitor’s alerts go”) and contact methods + rules (per-user “how I personally want to be reached”) is intentional. It lets the team configure routing once and lets each engineer set their own delivery preferences.

When NOT to use this

Don’t open incidents for transient flakes. Set the notification group’s delay_send_after_minutes so a check that fails once and recovers within a minute doesn’t page anyone. See Notification groups.
Don’t escalate everything. Reserve escalation policies for high-urgency incidents. For warnings, a single Slack channel notification is enough.
Don’t put your CTO on every escalation policy. They’ll mute everything within a week. Build escalation chains with clear ownership: primary on-call → secondary → engineering manager.

Incident Management

The five building blocks

How they fit together

When NOT to use this

See also