Skip to content

Incidents

An incident is the durable record of “something is broken.” Every page, every escalation, every status-page update hangs off an incident. They’re created two ways:

  1. Auto — when a monitor flips from healthy to failing, SiteQwality opens an incident with auto_created: true.
  2. Manual — when you create one in the dashboard or via POST /incident. Use this for non-monitor events (“partial Stripe outage breaking webhooks,” “a cron job took down the queue”).

Either way the lifecycle is the same.

FieldWhat it is
titleShort, scannable label. "Elevated 5xx errors on /api/checkout". Show this in alerts.
severityminor, major, or critical. Drives display priority on status pages and dashboards.
statusinvestigatingidentifiedmonitoringresolved. Forward-only in practice; see notes.
responder_statustriggered (no one’s looking), acknowledged (someone’s on it), or resolved.
acknowledged_at / acknowledged_bySet when a responder acknowledges. Stops escalation.
affected_http_job_idsWhich HTTP monitors this incident is about — drives which status-page components get a banner.
service_idOptionally tag an incident to a service (a logical grouping of monitors).
status_page_idIf set, the incident is published to that status page.
updates[]Chronological list of IncidentUpdates — status transitions with a written message.

Two parallel state machines that often confuse new users:

  • status is the external / customer-facing progress: “we’re still investigating,” “we know the cause,” “fix is rolled out, we’re watching.” It’s what shows on the status page.
  • responder_status is the internal / responder state: “no one’s on this,” “someone has it,” “we’re done.” It’s what drives escalation and ack tracking.

A status page incident can be status: monitoring while still responder_status: acknowledged — meaning the customer-facing message is “we’re keeping an eye on it” but a human is still actively babysitting.

SeverityUse for
minorSingle non-critical endpoint flapping; degraded performance below SLA but functional.
majorCustomer-facing feature broken for a subset of users; one region down with traffic still served from another.
criticalFull outage; data loss; security incident. Wakes everyone up.

The severity field is informational — it doesn’t change escalation behavior. To change who gets paged, use a different notification group or escalation policy. Severity drives sort order, color coding, and any filters you set up on the alerting fabric.

Lifecycle: a typical auto-created incident

Section titled “Lifecycle: a typical auto-created incident”
  1. HTTP check prod-api flips from success to failure at 03:14:22.
  2. Incident opens automatically — auto_created: true, status: investigating, responder_status: triggered, affected_http_job_ids: [<job_id>].
  3. The check’s notification group fires; channels get the page; targeted users’ contact methods fire per their notification rules.
  4. The on-call engineer receives an SMS at 03:14:30. They open the dashboard and click Acknowledge at 03:15:10.
  5. responder_status flips to acknowledged. Escalation stops.
  6. They post an incident update at 03:18 — status: identified, message "Database failover stuck. Manually promoting standby."
  7. The fix lands at 03:24. The check goes green on the next tick.
  8. SiteQwality auto-resolves the incident — status: resolved, responder_status: resolved, resolved_at set.

You can override any of these transitions manually if needed: the auto-resolution will be skipped if you’ve already moved the incident to resolved yourself.

  • A third-party service breaks something for your customers (Stripe, AWS, your CDN). The monitors don’t know — but you do.
  • A bug in production that doesn’t manifest as a monitor failure (UI broken, auth flow broken for a subset of users).
  • Coordination of a complex outage across multiple teams. One canonical incident with a clear timeline beats six Slack threads.

For these, use POST /incident (standalone) or POST /status_page/{id}/incident (auto-publishes to a specific page).

Acknowledging an incident is not resolving it. Ack stops the page (so escalation won’t keep paging), but the incident stays open until someone explicitly resolves it. This is intentional — the responder claims ownership, then has the time to actually fix the thing without their phone going off.