Understand the grace period
If your nightly backup ran at 02:00 yesterday and is currently running at 02:05 today, is it “late”? SiteQwality has to draw the line somewhere — that’s the grace period.
The current behavior
Section titled “The current behavior”A cron check is considered failing when:
now - last_received_at > check_interval_seconds × 2That is — the implicit grace period is one full check interval. A daily job (interval = 86400s) doesn’t alert until 48 hours have passed without a ping. A 5-minute job alerts after 10 minutes of silence.
This is conservative on purpose: most jobs vary by ±10% in runtime, and adding paging stress for “you’re 2 minutes late” is more harm than help.
How to tune
Section titled “How to tune”Job with very tight schedule
Section titled “Job with very tight schedule”If your job must finish within 5 minutes of its schedule:
- Don’t lower
check_interval_secondsbelow the actual interval (that creates false positives). - Instead, alert on runtime metrics — emit a metric for job duration and alert when it crosses your threshold. The cron check stays as the catch-all for “didn’t run at all.”
Job with very variable runtime
Section titled “Job with very variable runtime”If your job sometimes takes 30 minutes and sometimes 6 hours:
- Set
check_interval_secondsbased on the schedule between runs, not the runtime. A 6-hour job that runs once a day still has a 24-hour interval. - Make sure your job pings at the end, not the start.
Job that runs every minute
Section titled “Job that runs every minute”For high-frequency jobs (check_interval_seconds < 5min), the 2× grace can be too tight if the job ever has a hiccup. Consider:
- Bumping the interval to 5 or 10 minutes and accepting some staleness in the alert.
- Wrapping the job’s pings in retry logic so a single network blip doesn’t trigger the check.
What about jobs that fail mid-run
Section titled “What about jobs that fail mid-run”Cron checks don’t know whether your job succeeded — only whether it pinged. If the job crashes after starting but before pinging, the next interval’s check will catch it.
For finer-grained “did the job succeed?” tracking:
- Have the job emit a metric —
job.runs.completed{name="nightly_backup", status="success"}— and alert on the failure rate. - Or have the job ping a different check on success vs. failure (you’ll need two cron checks but the signal is precise).