The Art of On-Call Rotations#
On-call does not have to be miserable. I have been on both sides — rotations that burned people out in weeks, and rotations where engineers actually volunteered for extra shifts. The difference is never about tooling. It is about culture, process, and respect for people's time. Here is what I have learned building on-call programs across three different organizations.
Building a Healthy On-Call Culture
The single most important rule: if you build it, you own it. The team that writes the code is the team that gets paged when it breaks. This eliminates the toxic pattern where a separate "ops team" absorbs all the pain of poorly written software while the developers who caused the problem sleep soundly.
But ownership without support is just punishment. Here is what healthy on-call looks like in practice:
- Rotations of at least 5-6 people so no one is on-call more than one week per month.
- A primary and secondary on-call. The secondary is a safety net, not a co-responder.
- Compensatory time off after particularly rough shifts. If you got paged 4 times overnight, take a half day the next morning. No questions asked.
- On-call load is tracked as a team metric. If one team is getting paged 10x more than others, that is a signal to invest in reliability work, not to hire more people.
We made on-call metrics part of our quarterly planning. Each team has an "interrupt budget" — if pages exceed the budget, reliability work gets prioritized over feature work in the next sprint. This creates a direct incentive to fix recurring issues rather than just acknowledge and move on.
Runbook Templates That Actually Get Used
Every alert must have a linked runbook. No exceptions. An alert without a runbook is just a notification that something is wrong, with no guidance on what to do about it. Here is the template we use:
## Alert: [service-name] High Error Rate
### What is happening
The 5xx error rate for [service] has exceeded 5% for 5 minutes.
### Impact
Users may see failed requests on [feature]. Affected endpoints: /api/v1/orders, /api/v1/payments
### First response (< 5 minutes)
1. Check Grafana dashboard: [link]
2. Check recent deployments: `kubectl rollout history deployment/[service] -n production`
3. Check dependent services: [service-a], [service-b] status pages
### Common causes
- **Recent deployment**: Rollback with `kubectl rollout undo deployment/[service] -n production`
- **Database connection pool exhaustion**: Restart pods with `kubectl rollout restart deployment/[service]`
- **Upstream dependency failure**: Check [dependency] status. If down, enable circuit breaker: [link]
### Escalation
If unresolved after 15 minutes, escalate to #incident-response and page the secondary on-call.
The key is specificity. A runbook that says "investigate and fix" is useless at 3 AM when your cognitive function is at 50%. Good runbooks give you exact commands to run and exact links to click.
Escalation Policies That Make Sense
Our escalation chain has three tiers with clear timing:
- 0-5 minutes: Primary on-call is paged via PagerDuty (push notification + phone call).
- 5-15 minutes: If no acknowledgment, secondary on-call is paged.
- 15-30 minutes: If still unacknowledged or if the incident is P1 severity, the engineering manager is paged and an incident channel is auto-created in Slack.
We classify severity levels clearly: P1 means user-facing functionality is down for more than 10% of users. P2 means degraded performance or partial outage. P3 means an internal system is unhealthy but user impact is minimal. Only P1 and P2 page outside business hours. P3 alerts go to Slack and are handled the next business day.
This classification alone reduced our after-hours pages by 35%. Many alerts we were waking people up for turned out to be P3 issues that could wait until morning.
Blameless Postmortems
Every P1 and P2 incident gets a postmortem within 48 hours. The format is straightforward: timeline, root cause, impact, and action items. But the cultural part matters more than the template. Two rules we enforce strictly:
First, no finger-pointing. We never write "Engineer X deployed a bad config." We write "A configuration change was deployed that had not been validated in staging." The focus is always on the system and process, not the individual. If a human made an error, the question is always "what guardrail was missing that would have caught this?"
Second, every action item must have an owner and a due date. "We should add better monitoring" is not an action item. "Add a latency alert at p99 > 500ms for the payments service, owned by @alice, due by Dec 15" is an action item. We track completion rates — our target is 90% of action items completed within their deadline.
Reducing Alert Fatigue
Alert fatigue is the silent killer of on-call programs. When engineers are getting 20+ alerts per shift and most of them are noise, they stop taking any alert seriously. We audit our alerts quarterly using a simple framework:
- Delete alerts that have never fired, or that fire constantly without anyone acting on them.
- Tune alerts with a signal-to-noise ratio below 70% (more than 30% of firings are false positives).
- Merge alerts that always fire together into a single, higher-level alert.
- Automate alerts where the response is always the same manual step (restart a pod, clear a cache).
We went from 47 active alerts to 18 after our first audit. Pages per week dropped from an average of 12 to 3. The remaining alerts all require human judgment and have clear runbooks.
The measure of a good on-call program is not how fast you respond to incidents. It is how few incidents require a response in the first place.
On-call is a team sport. Invest in the culture, the runbooks, and the alert quality, and your engineers will stop dreading their rotation weeks.