Incident Management
When a monitor breaks at 3am, five things have to happen — fast and in the right order:
- An incident is opened that records what’s broken and when.
- Someone is identified as the on-call responder.
- They’re paged through their preferred contact methods.
- If they don’t ack, the page escalates to the next person.
- No one gets paged for known maintenance.
SiteQwality’s incident management module wires these five together. You can use any of them in isolation — incidents without on-call, manual incidents with no monitors, notification channels with no on-call schedule — but the value compounds when they work together.
The five building blocks
Section titled “The five building blocks” Incidents Record + timeline of an outage. Auto-opened by monitors, or created manually for non-monitor events.
On-call schedules Layered rotations across timezones with overrides for holidays and handoffs.
Escalation policies Ordered list of who to page next if no one acknowledges, with timeouts.
Notifications Channels (email, SMS, Slack, Teams, Discord, Telegram, webhooks) and groups, plus per-user contact methods.
Maintenance windows Silence specific monitors during planned work — one-time or recurring.
How they fit together
Section titled “How they fit together”┌──────────────────────┐│ Monitor flips status │└──────────┬───────────┘ │ ▼┌──────────────────────┐│ Incident opened │ ←── manual incidents go here too└──────────┬───────────┘ │ ▼┌──────────────────────┐│ Notification group │ ── channels: email, Slack, webhooks, etc.│ (account-level) │└──────────┬───────────┘ │ ▼┌──────────────────────┐│ Escalation policy │ ── targets: users + on-call schedules│ (per incident) │└──────────┬───────────┘ │ ▼┌──────────────────────┐│ User contact methods │ ── personal: email, SMS, phone-call, Slack DM│ × notification rules │ ── personal: "high urgency? SMS at 0min, call at 5min"└──────────────────────┘The split between notification groups (account-level “where this monitor’s alerts go”) and contact methods + rules (per-user “how I personally want to be reached”) is intentional. It lets the team configure routing once and lets each engineer set their own delivery preferences.
When NOT to use this
Section titled “When NOT to use this”- Don’t open incidents for transient flakes. Set the notification group’s
delay_send_after_minutesso a check that fails once and recovers within a minute doesn’t page anyone. See Notification groups. - Don’t escalate everything. Reserve escalation policies for high-urgency incidents. For warnings, a single Slack channel notification is enough.
- Don’t put your CTO on every escalation policy. They’ll mute everything within a week. Build escalation chains with clear ownership: primary on-call → secondary → engineering manager.
See also
Section titled “See also”- Maintenance windows — silence monitors during deploys.
- Status pages — publish incidents externally.
- Integrations — connect Slack, Teams, Discord, Telegram, webhooks.