What are runbooks and why do you need them?
A runbook is a documented, step-by-step procedure for performing a specific operational task - written clearly enough that someone unfamiliar with the system can execute it correctly under time pressure. Runbooks are the operational documentation that bridges the gap between "this person knows how to do it" and "anyone on the team can do it." They are most critical during incidents, when stress is high and the person with the relevant expertise may be unavailable.
The distinction between a runbook and general documentation is precision. A knowledge base article might explain how your deployment pipeline works conceptually. A runbook tells you exactly what to type, what to click, and what to verify at each step when you need to deploy right now. Runbooks assume the reader has context about the team's tools but no specific knowledge of this particular operation.
PagerDuty's 2024 State of Digital Operations report found that organizations with maintained runbooks resolve incidents 40% faster on average than those without. The improvement is not because runbooks contain secret knowledge - it is because they eliminate the time spent figuring out what to do next. During an incident, decision fatigue is the bottleneck. Runbooks remove decisions and replace them with steps.
What should a runbook look like?
A good runbook is optimized for execution, not comprehension. The reader is not learning - they are doing. Every element of the format should make it faster to follow under pressure.
The standard runbook structure
- Title and purpose. A clear, specific title: "Restore PostgreSQL from backup" not "Database operations." One sentence describing when to use this runbook.
- Prerequisites. What access, tools, or state the reader needs before starting. "You need production database access (granted via Terraform) and the pg_restore CLI tool installed." If they do not have a prerequisite, tell them how to get it or who to contact.
- Steps. Numbered, sequential, and explicit. Each step should be a single action with an expected outcome. "Step 3: Run
pg_restore -d flux_prod backup_2026-05-19.sql. Expected: the restore completes in 2-5 minutes with no errors. If you see 'ERROR: relation already exists', proceed to the troubleshooting section." - Verification. How to confirm the operation succeeded. "Verify: run
SELECT count(*) FROM cards;- the count should match the pre-restore snapshot within 1%." - Rollback. What to do if the operation fails or needs to be undone. Not every operation is reversible, but document what you can. "If the restore fails, the previous database is still available at the read replica. Switch the application connection string to point to the replica."
- Troubleshooting. Common failure modes and their resolutions. These are learned from experience - every time someone follows the runbook and hits an undocumented error, they add it here.
Format rules for stress conditions
Runbooks are often read during incidents - high stress, time pressure, possibly at 3 AM. Design for those conditions:
- Commands must be copyable. Put every command in a code block. Never describe a command in prose when you can show it literally. "Run the backup restore command" is worse than showing the exact command with all flags.
- One action per step. "SSH into the server and restart the service" is two steps. Split them. The reader under pressure might execute the first action and then forget the second.
- Expected outcomes for every step. If the reader does not know what "success" looks like at each step, they cannot tell if something went wrong until the end - when it may be too late to recover.
- No ambiguity. "Wait a few minutes" is ambiguous. "Wait 3 minutes" is specific. "Check if the service is healthy" is ambiguous. "Run
curl -s localhost:3000/health- expected response:{"status":"ok"}" is specific.
| Runbook element | Good example | Bad example |
|---|---|---|
| Step instruction | Run docker restart flux-api | Restart the API container |
| Expected outcome | Container status shows "Up" within 10 seconds | Service should come back up |
| Prerequisite | SSH access to prod-server-1 (request via #infra-access Slack) | Server access required |
| Troubleshooting | If port 3000 is already in use, run lsof -i :3000 and kill the stale process | Sometimes the port is busy |
Which runbooks should your team write first?
You cannot document every operation on day one. Prioritize based on two factors: frequency (how often the operation is performed) and criticality (what happens if it goes wrong).
The essential five
- Production deployment. The complete process from code merge to running in production. Every team needs this, even if deployment is automated - the runbook covers what to do when automation fails.
- Incident response triage. What to do in the first 15 minutes when an alert fires. Who to notify, where to look, how to assess severity. Link this to your incident response template.
- Database backup and restore. How to take a backup, how to verify it, and how to restore from it. Test the restore procedure quarterly - a backup you have never restored from is not a backup.
- Secret rotation. How to rotate API keys, database credentials, and signing keys without downtime. This is the operation most likely to be needed urgently (after a credential leak) and least likely to be documented.
- Scaling under load. How to scale application instances, database connections, or cache capacity when traffic exceeds normal levels. Include both the manual process and the monitoring thresholds that should trigger it.
Expanding the library
After the essential five, add runbooks for any operation that meets one of these criteria:
- Only one person knows how to do it (bus factor of one)
- It has caused an incident in the past due to human error
- It is performed infrequently enough that the procedure is forgotten between occurrences
- It involves production data and a mistake could cause data loss
How do you use a kanban board to track runbook creation?
Writing runbooks is a project like any other - it needs tracking, assignment, and deadlines. A kanban workflow for runbook creation ensures that identified gaps get closed rather than languishing in a backlog.
Runbook board in Flux
Create a dedicated board or use a label on your existing engineering board. The workflow columns:
- Identified - an operation needs a runbook. The card describes the operation and why it needs documenting. Sources: incident retrospectives, on-call handoffs, bus factor audits.
- Drafting - someone is writing the runbook. The person who performs the operation most frequently should write the first draft.
- Testing - a different team member follows the runbook in a staging environment. If they can complete the operation without asking the author any questions, the runbook passes. If they cannot, the gaps become edits.
- Published - the runbook is in the knowledge base and the team knows where to find it.
- Needs Update - an existing runbook has been flagged as stale, usually after an incident where the steps were no longer accurate.
Maintenance cadence
Runbooks decay faster than most documentation because the systems they describe change frequently. Set a monthly review cadence for critical runbooks (deployment, incident response) and a quarterly cadence for others. Create recurring cards on your Flux board assigned to the runbook owner.
The most important maintenance trigger is incidents. After every incident, ask: did a runbook exist for this scenario? If yes, was it accurate and complete? If no, should one be created? Add the answer as a card on the board. Over time, your runbook library grows organically from the incidents your team actually experiences.
How do you test that a runbook actually works?
An untested runbook is an assumption. You are assuming the steps are correct, the commands work, and the expected outcomes match reality. The only way to validate a runbook is to have someone who did not write it follow the steps in a realistic environment.
The peer-test protocol
When a runbook moves from Drafting to Testing on the board, a different team member attempts to follow it in staging. The rules:
- The tester cannot ask the author any questions. If a step is unclear, they flag it on the card instead of asking - the runbook must stand alone.
- The tester notes every point of confusion, every missing prerequisite, and every command that does not produce the expected output.
- The author incorporates the feedback and the tester tries again. Repeat until the runbook passes without questions.
Periodic fire drills
For critical runbooks (incident response, database restore), run fire drills quarterly. A team member follows the runbook in a staging or test environment under simulated pressure. This validates both the runbook and the team's operational readiness. Document the fire drill results - time to complete, steps that caused confusion, environment differences from production - and update the runbook accordingly.
For a broader perspective on operational documentation and how it fits into your team's knowledge management practice, see the knowledge management guide. For a DevOps-oriented board structure that includes runbook tracking alongside deployment and infrastructure work, see the DevOps pipeline template.