Skip to content

Escalation policies

An escalation policy is the rule for what to do if no one acknowledges an incident. It’s an ordered list of levels; each level fires its targets, waits its timeout, and if no acknowledgement arrives, the next level fires.

You attach an escalation policy to an incident-routing path (typically via a notification group → escalation policy chain). When the incident opens, the policy starts at level 1; each level transition advances or restarts based on timeouts.

FieldWhat it is
nameHuman label. "Primary on-call escalation".
repeat_countAfter exhausting all levels, how many full passes to run. 0 means stop.
levels[]Ordered list of EscalationLevels.

Each level:

FieldWhat it is
level_numberSort order. Auto-assigned to the next integer if omitted.
timeout_minutesHow long to wait for an ack before moving on.
target_user_idsPage these specific users (their notification rules + contact methods kick in).
target_schedule_idsPage whoever’s on-call in these schedules right now.

Both target arrays can be combined on a single level: “page Alice and whoever’s on the platform schedule, wait 5 minutes, then move on.”

Incident opens
┌─ Level 1 ──────────────────┐
│ targets: [on_call_primary] │
│ timeout: 5 min │
└────────┬───────────────────┘
no ack within 5 min
┌─ Level 2 ──────────────────┐
│ targets: [on_call_backup] │
│ timeout: 10 min │
└────────┬───────────────────┘
no ack within 10 min
┌─ Level 3 ──────────────────┐
│ targets: [eng_manager] │
│ timeout: 15 min │
└────────┬───────────────────┘
no ack within 15 min
repeat or stop (per repeat_count)

If anyone acks at any level, the policy stops dead. The acked incident keeps existing — the responder works the fix.

Two different mechanisms with overlapping vibes:

  • Notification groups are the raw fanout: “this monitor’s alerts go to these channels.” Channel-level. No retry, no escalation.
  • Escalation policies are the smart routing: “if no one acks, move on.” User and schedule level, hooking into per-user notification rules.

A monitor’s alert always fires its notification group. The group can include channels that further trigger an escalation policy. In practice: the group sends a Slack ping for visibility; the escalation policy is what actually pages the on-call.

repeat_count: 0 means “after exhausting all levels, give up.” That’s almost never what you want. Set it to at least 1 or 2 so the policy keeps re-paging until acked.

After a full repeat, the policy starts over at level 1. Setting repeat_count: 999 effectively means “page forever until acked.”

  • One per service if your on-call rotations differ by service ownership.
  • One per severity if critical should escalate faster than major.
  • One per environment if prod should hit on-call but staging should only ping a Slack channel.