Escalation policies

An escalation policy is the rule for what to do if no one acknowledges an incident. It’s an ordered list of levels; each level fires its targets, waits its timeout, and if no acknowledgement arrives, the next level fires.

What starts an escalation

An escalation runs when an incident that has a service_id opens, and that service has an escalation_policy_id set. That is the only wiring: the policy is attached to the service, not to the monitor and not to the notification group.

This applies equally to incidents a monitor opened automatically and to incidents you created by hand. An incident with no service_id, or whose service has no policy, never escalates.

Anatomy

Field	What it is
`name`	Human label. `"Primary on-call escalation"`.
`repeat_count`	How many full passes through the levels to run. Defaults to `3`.
`levels[]`	Ordered list of `EscalationLevel`s.

Each level:

Field	What it is
`level_number`	Sort order. Auto-assigned to the next integer if omitted.
`timeout_minutes`	How long to wait for an ack before moving on.
`target_user_ids`	Page these specific users (their notification rules + contact methods kick in).
`target_schedule_ids`	Page whoever’s on-call in these schedules right now.

Both target arrays can be combined on a single level: “page Alice and whoever’s on the platform schedule, wait 5 minutes, then move on.”

Lifecycle

Incident opens
     │
     ▼
┌─ Level 1 ──────────────────┐
│ targets: [on_call_primary] │
│ timeout: 5 min             │
└────────┬───────────────────┘
         │
   no ack within 5 min
         │
         ▼
┌─ Level 2 ──────────────────┐
│ targets: [on_call_backup]  │
│ timeout: 10 min            │
└────────┬───────────────────┘
         │
   no ack within 10 min
         │
         ▼
┌─ Level 3 ──────────────────┐
│ targets: [eng_manager]     │
│ timeout: 15 min            │
└────────┬───────────────────┘
         │
   no ack within 15 min
         │
         ▼
   repeat or stop (per repeat_count)

Each level does three things in order: resolve its targets (on-call schedules resolve at firing time, so handoffs work transparently), page each resolved user through their own notification rules, and schedule a one-shot timeout for timeout_minutes later. When that timeout fires, the policy advances.

A level with no resolvable targets still schedules its timeout and still advances. It just pages nobody.

Acknowledging stops it

Acknowledging or resolving the incident cancels the escalation: its state moves to acknowledged and any pending timeout and per-user delay schedules are deleted. The incident itself keeps existing, and the responder works the fix.

Every route into acknowledgement does this, so it does not matter how the responder answers:

POST /incident/{id}/acknowledge and POST /incident/{id}/resolve
the one-click links in email and SMS
the Acknowledge and Resolve buttons in Slack
automatic recovery, when the monitor that opened the incident comes back

Targets vs notification groups

Two different mechanisms with overlapping vibes:

Notification groups are the raw fanout: “this monitor’s alerts go to these channels.” Channel-level. No retry, no escalation.
Escalation policies are the smart routing: “if no one acks, move on.” User and schedule level, hooking into per-user notification rules.

A monitor’s alert always fires its notification group. The group can include channels that further trigger an escalation policy. In practice: the group sends a Slack ping for visibility; the escalation policy is what actually pages the on-call.

Repeat cycles

repeat_count is how many full passes through the levels the policy makes, and it defaults to 3.

Walking the last level’s timeout takes the policy back to level 1 and counts one completed pass. When the completed passes reach repeat_count, the escalation becomes exhausted and stops paging.

`repeat_count`	Behavior
`3` (default)	Three full passes through every level, then stop.
`1`	One pass. The last level’s timeout exhausts the policy.
`0`	Degenerate. Level 1 pages, and the first timeout exhausts the policy immediately. It never reaches level 2.
large	Keeps cycling. There is no enforced maximum.

Escalation state

Each incident carries one escalation state, visible on the incident:

State	Meaning
`active`	Cycling through levels.
`acknowledged`	Someone acked or resolved. Paging stopped, pending schedules deleted.
`exhausted`	Ran out of repeat cycles without an acknowledgement. Level cycling has stopped.

When to use multiple policies

One per service if your on-call rotations differ by service ownership.
One per severity if critical should escalate faster than major.
One per environment if prod should hit on-call but staging should only ping a Slack channel.