Incident Response Playbooks: A Practical Guide to Handling Website Outages
When a production outage hits at 3 AM, the difference between a 15-minute recovery and a 4-hour scramble comes down to preparation. This guide walks through building incident response playbooks that actually work under pressure — not theoretical frameworks, but battle-tested procedures drawn from how high-reliability teams operate.
Why Most Incident Response Plans Fail
Most organizations have incident response plans. The problem is that the majority of these plans sit in a wiki that nobody has read since it was written. When an actual outage occurs, the documented process breaks down for predictable reasons: the plan assumes specific people are available, it references tools that have since changed, or it describes steps too vaguely to execute under stress.
The fundamental issue is that incident response plans are typically written during calm periods by people who are thinking logically. They're executed during chaotic periods by people who are stressed, sleep-deprived, and dealing with incomplete information. This gap between authoring conditions and execution conditions is where most plans fail.
Effective playbooks account for this reality. They use checklists rather than paragraphs. They specify exact commands rather than general directions. They define roles clearly enough that anyone on the team can step into any role, not just the person who usually fills it. And critically, they are tested regularly through tabletop exercises and game days, not just reviewed annually.
The Cognitive Load Problem
Research in human factors engineering shows that under stress, working memory capacity drops significantly. A playbook that requires reading and interpreting 3-page procedures will fail. Effective incident response documents are designed for scanning, not reading — short sentences, numbered steps, and clear decision points.
Severity Classification: Getting the First Decision Right
The first minutes of an incident determine its trajectory. Getting the severity classification right — or at least approximately right — affects who gets paged, what communication channels activate, and how quickly the organization mobilizes. Classification systems that rely on subjective judgment ("Is this a major incident?") create hesitation and inconsistency.
SEV-1: Complete Service Failure
- • Primary service entirely unavailable to all users
- • Data loss or corruption actively occurring
- • Security breach with active exploitation
- Response: All-hands, executive notification within 15 minutes, status page updated within 5 minutes
SEV-2: Major Degradation
- • Core functionality degraded for a significant user segment
- • Performance degraded beyond acceptable thresholds
- • Secondary systems failed with potential cascade risk
- Response: On-call team plus relevant domain experts, status page updated within 15 minutes
SEV-3: Minor Impact
- • Non-critical feature unavailable
- • Workaround exists for affected users
- • Performance slightly degraded but within tolerance
- Response: On-call engineer investigates during business hours, no immediate escalation required
SEV-4: Informational
- • Anomaly detected but no user impact confirmed
- • Monitoring alert triggered but within expected variance
- • Potential issue identified during routine checks
- Response: Logged for review, investigated during next business day
The key insight about severity classification is that it should be based on observable impact, not on the root cause. A database failover that completes seamlessly is SEV-4 even though the underlying event is significant. A CSS bug that makes the checkout button invisible on mobile is SEV-1 even though the technical issue is trivial. Impact to users determines severity, not the complexity of what broke.
Teams should also establish a clear rule: when in doubt, escalate. It's far less costly to downgrade a SEV-1 to a SEV-3 after investigation than to discover that a SEV-3 should have been a SEV-1 an hour later. Build a culture where over-escalation is praised, not punished.
The Incident Commander Role
Every significant incident needs exactly one person coordinating the response. This role — commonly called the Incident Commander (IC) — is borrowed from emergency services and adapted for technical operations. The IC doesn't need to be the most senior engineer or the deepest domain expert. They need to be calm under pressure, organized, and willing to make decisions with incomplete information.
The IC's responsibilities are coordination, not technical problem-solving. They maintain the timeline of events, assign investigation tasks, manage communication to stakeholders, and decide when to escalate or de-escalate. When the IC starts debugging code, the coordination function stops, and the incident response degrades into unstructured troubleshooting.
Incident Response Roles
Incident Commander
Coordinates the response. Makes escalation decisions. Manages the communication cadence. Does NOT debug technical issues directly.
Technical Lead
Drives the technical investigation. Proposes and evaluates mitigation options. Delegates specific diagnostic tasks to other engineers.
Communications Lead
Drafts and publishes status page updates. Manages internal stakeholder notifications. Handles customer-facing communications if needed.
Scribe
Records the timeline of events, decisions made, and actions taken. This record becomes the foundation for the post-incident review.
A common mistake is assigning these roles based on seniority or job title. Instead, they should rotate across the team. Every engineer who participates in on-call should be trained and practiced in each role. This prevents single points of failure in the incident response process itself — which would be ironic for a team focused on eliminating single points of failure in their systems.
Building Runbooks for Common Failure Modes
While every incident is unique in its details, most fall into a relatively small number of categories. Database connection exhaustion, certificate expiry, disk space exhaustion, memory leaks, DNS propagation failures, and deployment rollback are scenarios that recur across organizations. For each of these, a specific runbook should exist with exact diagnostic commands and remediation steps.
Example: Database Connection Pool Exhaustion
This is one of the most common causes of application-level outages. The application appears to hang or return timeout errors, but the database server itself shows normal CPU and memory usage. The issue is that the connection pool is saturated — all connections are checked out and none are being returned.
Diagnostic Steps
- Check application connection pool metrics (active connections, waiting threads, pool size)
- Query database for active connections grouped by application host and state (active, idle, idle in transaction)
- Look for long-running queries or transactions that are holding connections open
- Check if connection pool max size was recently changed or if traffic increased significantly
- Review application logs for connection timeout errors and their timestamps
Remediation Options (in order of preference)
- Terminate idle-in-transaction connections that have been open longer than the expected transaction timeout
- Kill specific long-running queries if they can be identified as runaway
- Temporarily increase the connection pool maximum (with awareness of database connection limits)
- Rolling restart of application instances to force connection pool reset
- If caused by a deployment, initiate rollback procedure
Example: Memory Leak Leading to OOM Kills
Memory leaks often present as gradually degrading performance followed by a sudden crash when the operating system's out-of-memory killer terminates the process. The insidious aspect is that the application may restart automatically and appear healthy for a period before the cycle repeats.
The diagnostic approach differs from connection pool issues because the symptom often isn't "the application is broken" but rather "the application keeps restarting." Check system logs for OOM kill events. Graph memory usage over time — a characteristic sawtooth pattern (gradual increase followed by sharp drop) confirms a leak with automatic restarts.
Short-term mitigation for memory leaks typically involves more frequent restarts (reducing the time between restarts so memory doesn't reach dangerous levels) while the engineering team identifies and fixes the leak. This is an explicitly temporary measure — scheduled restarts mask the problem and should never become permanent.
Communication During Incidents
Poor communication during an outage causes more organizational damage than the outage itself. Users who see a broken service with no acknowledgment assume the team is either unaware or incompetent. An honest status update that says "We are aware of the issue and investigating" immediately shifts the perception from "nobody is home" to "they're working on it."
Status Page Update Cadence
For SEV-1 and SEV-2 incidents, status page updates should follow a strict cadence: initial acknowledgment within 5-15 minutes of detection, then updates every 30 minutes until resolution. Even if there is no new information, post an update saying "Investigation is ongoing. No new information at this time." Silence breeds speculation and erodes trust.
The content of status updates matters as much as their frequency. Avoid technical jargon that means nothing to users. "We are experiencing elevated error rates on our API endpoints due to database replication lag" is meaningful to engineers but not to customers. "Some users may experience errors when loading their dashboard. Our team is actively working to resolve this." communicates the same information in a way that tells users what they need to know: what's broken and that someone is fixing it.
Communication Template: Initial Acknowledgment
Title: [Service Name] - Investigating [Impact Description]
Body: We are currently investigating reports of [specific user-facing impact]. Our team is actively working to identify and resolve the issue. We will provide updates every [30 minutes / 1 hour] until the issue is resolved.
Status: Investigating
Post-Incident Reviews: Learning Without Blame
The post-incident review (often called a postmortem or retrospective) is where incidents become investments in reliability rather than just costs. The goal is not to find who made a mistake — it's to understand why the system allowed that mistake to cause an outage and what structural changes would prevent similar incidents.
Blameless post-incident reviews are not about avoiding accountability. They're about recognizing that in complex systems, outages are caused by systemic factors, not individual errors. If one person's action can bring down a production system, the problem is that one person's action can bring down a production system — not that the person took that action.
Effective Post-Incident Review Structure
Timeline Reconstruction
Build a precise timeline from the scribe's notes and system logs. Focus on what happened and when, not why or who. Include detection time, response time, mitigation time, and resolution time.
Contributing Factors Analysis
Identify all factors that contributed to the incident occurring and to the duration of the outage. Technical causes, process gaps, monitoring blind spots, and communication breakdowns are all valid contributing factors.
Action Items with Owners and Deadlines
Every post-incident review should produce concrete, assigned, time-bound action items. "Improve monitoring" is not an action item. "Add alerting for connection pool utilization exceeding 80%, assigned to [person], due by [date]" is.
The most critical metric for post-incident review effectiveness is action item completion rate. If reviews consistently produce action items that never get done, the review process becomes performative. Track completion rates and escalate when items are consistently deferred. The action items from post-incident reviews represent the organization's commitment to not repeating the same failures — ignoring them is accepting the same risk repeatedly.
Practicing Incident Response: Game Days
A playbook that hasn't been practiced is a hypothesis, not a plan. Game days — scheduled exercises where the team responds to simulated incidents — are the most effective way to validate that playbooks work and that team members are comfortable in their roles.
Start with tabletop exercises where the team walks through scenarios verbally. "It's Tuesday at 2 PM. The monitoring system alerts that API error rates have jumped from 0.1% to 15%. What do you do?" Walk through the playbook step by step. Identify where it's unclear, where it references tools that have changed, or where assumptions don't hold.
As the team matures, progress to live exercises. Intentionally introduce failures in a staging environment (or production, if you've built sufficient safety controls) and have the team respond using the playbook. This reveals gaps that tabletop exercises miss — the monitoring alert that fires but goes to the wrong channel, the runbook command that requires permissions the on-call engineer doesn't have, or the escalation path that routes to someone who left the company three months ago.
The goal of game days is not to test whether the team can handle the scenario — it's to find and fix the problems in the incident response process before a real incident exposes them. Every game day should produce a list of improvements to playbooks, tooling, and processes. If a game day runs perfectly, either the scenario was too simple or the team isn't looking hard enough for gaps.
Measuring Incident Response Effectiveness
Improving incident response requires measuring it. The core metrics are Mean Time to Detect (MTTD), Mean Time to Respond (MTTR), Mean Time to Mitigate (MTTM), and Mean Time to Resolve (MTTR — note this abbreviation is used for both "respond" and "resolve," which causes confusion; be explicit about which you mean).
Mean Time to Detect (MTTD)
The time between when an incident starts and when it's detected. Improving MTTD is primarily a monitoring and alerting problem. If your MTTD is consistently longer than a few minutes, your monitoring has blind spots.
Mean Time to Mitigate (MTTM)
The time between detection and when user impact is reduced or eliminated, even if the underlying cause isn't fixed. This is the metric most directly tied to user experience and business impact.
Track these metrics over time and look for trends. A rising MTTD suggests your monitoring isn't keeping pace with system changes. A rising MTTM suggests your runbooks need updating or your team needs more practice. Falling metrics confirm that your investment in incident response process is paying off.
Key Takeaways
Building effective incident response is not a one-time project — it's an ongoing practice. Start with a simple severity classification system and basic roles. Write runbooks for your three most common failure modes. Practice with tabletop exercises quarterly. Conduct blameless post-incident reviews for every significant incident, and track action item completion religiously.
The organizations that handle outages well aren't the ones with the most sophisticated tooling — they're the ones that have practiced the most. Incident response is a skill, and like all skills, it improves with deliberate practice and honest feedback.