Postmortems
Capture learnings from incidents and prevent recurrence.
Overview
A postmortem is a written record of an incident that answers: What happened? Why did it happen? What are we doing to prevent it from happening again?
Runframe makes postmortems easy by pre-filling incident data and tracking follow-up items.
Why write postmortems?
| Benefit | Explanation |
|---|---|
| Institutional memory – New team members learn from past incidents | |
| Systemic improvements – Identify and fix underlying issues | |
| Blameless culture – Focus on systems, not individuals | |
| Customer communication – Explain what happened and what you’re doing | |
| Compliance – Many organizations require incident documentation |
When to write postmortems
| Severity | Postmortem recommendation |
|---|---|
| P0 | Always required |
| P1 | Always required |
| P2 | Recommended |
| P3 | Optional |
| P4 | Skip unless it reveals a systemic issue |
Creating a postmortem
Automatic creation from incidents
When you resolve an incident, Runframe prompts you to start a postmortem:
/inc resolve
You’ll be asked:
- Provide a resolution summary
- Whether to start a postmortem
- Options: Yes, Later, or Skip
If you choose Yes, Runframe creates a PM-### document with:
- Incident title and number
- Severity and customer impact flag
- Timeline of status changes and key events
- Assigned responders
- Resolution summary
- Pre-populated sections to complete
Manual creation
Create a postmortem from the dashboard:
- Navigate to Incidents in the sidebar
- Click on the resolved incident
- Click Start Postmortem in the top right
- Runframe creates the PM-### document
Creating standalone postmortems
For incidents that occurred outside Runframe:
- Navigate to Postmortems in the sidebar
- Click New Postmortem
- Fill in the incident details manually
Postmortem template
Every postmortem follows a consistent structure for readability and searchability.
Header
| Field | Description |
|---|---|
| PM-### | Unique postmortem identifier |
| Incident INC-### | Links to the original incident |
| Title | Brief summary of the incident |
| Severity | P0 through P4 |
| Date | When the incident occurred |
| Duration | Time from detection to resolution |
| Customer impact | Whether customers were affected |
Sections
Summary
A 2 to 3 sentence overview. Write this last—it should synthesize the full postmortem.
Impact
What happened and who was affected:
- Which services were down or degraded?
- How many users were affected?
- What functionality was broken?
- Was data lost or corrupted?
Timeline
Chronological sequence of events. Use specific timestamps:
| Time | Event |
|---|---|
| 2:15 PM PT | Incident detected via Datadog alert |
| 2:17 PM PT | On-call engineer paged |
| 2:22 PM PT | Incident acknowledged, investigating |
| 2:35 PM PT | Root cause identified: database connection pool exhaustion |
| 2:45 PM PT | Hotfix deployed to production |
| 2:50 PM PT | Error rates returning to normal |
| 3:15 PM PT | Incident resolved |
Root cause
The technical explanation of what went wrong. Be specific:
- What code or configuration change caused this?
- What was the cascade of failures?
- What monitoring was missing?
- What made this hard to detect or fix?
Resolution and recovery
How you fixed it:
- What was the immediate fix?
- What was the permanent fix?
- How long did each step take?
- What was the rollback plan if the fix didn’t work?
Follow-up actions
Specific, actionable items to prevent recurrence. Each action should have:
| Action | Owner | Due date | Status |
|---|---|---|---|
| Add monitoring for connection pool usage | @alice | 2026-01-15 | Open |
| Increase connection pool size from 10 to 50 | @bob | 2026-01-10 | Complete |
| Run load tests before database migrations | @carol | 2026-01-20 | Open |
Lessons learned
What went well and what didn’t:
What went well:
- On-call responded within SLA
- Hotfix was deployed quickly
- Communication to stakeholders was clear
What could be improved:
- Monitoring should have detected connection pool growth earlier
- Runbook for database issues didn’t cover this scenario
- Escalation to database specialist was delayed
Be blameless
Postmortems focus on systems and processes, not individuals. Never blame specific people. Instead of “@alice forgot to update the config,” write “The deployment process didn’t validate config changes.”
Writing a great postmortem
Do’s
- Be specific – Exact times, specific services, precise error messages
- Be concise – Most postmortems should be under 1,000 words
- Include data – Metrics, screenshots, graphs from monitoring tools
- Assign actions – Every follow-up should have an owner and due date
- Share learnings – Distribute to the wider team, not just participants
Don’ts
- Don’t assign blame – Systems fail, not people
- Don’t be vague – “We’ll be more careful next time” is not a follow-up action
- Don’t bury the lede – Put the most important information in the summary
- Don’t forget the customer – Even internal incidents affect someone
Postmortem review process
After drafting:
- Technical review – Have a senior engineer verify root cause analysis
- Comms review – Ensure customer impact is accurate if sharing externally
- Action item review – Verify all actions are specific and assigned
Managing follow-up actions
Runframe tracks follow-up actions from postmortems to ensure completion.
Creating actions
From the postmortem editor, add follow-up actions:
- Description – What needs to be done
- Owner – Who is responsible (assign to a user)
- Due date – When it should be completed
- Priority – P0 through P4
Tracking actions
View all open actions from:
- Postmortems → Action Items in the dashboard
- Individual postmortem pages
- Slack notifications for overdue items
Closing actions
When an action is complete:
- Navigate to the postmortem
- Mark the action as Complete
- Add a brief note explaining what was done
- Runframe records the completion timestamp
Overdue actions
Runframe notifies:
- The action owner when due date approaches
- The postmortem author when actions are overdue
- Team managers for critical overdue actions
Action items create a feedback loop
Completing postmortem actions reduces future incidents. Track action completion rate as a metric of incident response effectiveness.
Postmortem culture
A healthy postmortem culture accelerates learning and builds trust.
Blameless postmortems
Blameless doesn’t mean no accountability—it means:
- Accountability to systems – Did our processes, tooling, or training fail?
- Psychological safety – People feel safe sharing mistakes without fear of retribution
- Focus on improvement – Energy goes into preventing recurrence, not assigning fault
When something goes wrong, ask:
- “What system failed?” not “Who messed up?”
- “What process allowed this?” not “Why didn’t you catch this?”
- “How do we prevent this?” not “Who’s responsible?”
Making time for postmortems
Postmortems take time away from feature work, but pay dividends in reduced incidents:
- Schedule postmortem time – Block 1 to 2 hours within 48 hours of resolution
- Write while fresh – Details fade quickly; write soon after incidents
- Collaborative drafting – Two heads are better than one; write together
Sharing postmortems
Decide how broadly to share based on severity and customer impact:
| Severity | Internal sharing | External sharing |
|---|---|---|
| P0 | All-hands, executive summary | Public blog post, customer notification |
| P1 | Engineering-wide, relevant teams | Customer notification if impact warrants |
| P2 | Team-specific, postmortem channel | Typically internal only |
| P3-P4 | Team-specific | Internal only |
Postmortem analytics
Track postmortem metrics to improve your incident response:
| Metric | Definition | Goal |
|---|---|---|
| Postmortem completion rate | P0-P1 incidents with postmortems | 100% |
| Time to postmortem | Days from incident resolution to postmortem completion | Under 7 days |
| Action completion rate | Follow-up actions completed within 30 days | Above 80% |
| Recurrence rate | Incidents caused by same root cause within 90 days | Under 5% |
View metrics in the Analytics section of the dashboard.
Examples
Example postmortem: Database connection pool exhaustion
# PM-001: Database connection pool exhaustion
**Incident:** INC-042
**Severity:** High
**Date:** January 15, 2026
**Duration:** 58 minutes
**Customer impact:** Yes – checkout flows failing
## Summary
A database connection pool exhaustion caused 500 errors on checkout flows for 58 minutes. Root cause was a recent deployment that increased concurrent query capacity without adjusting connection pool size. Fixed by increasing pool size and adding monitoring for pool usage.
## Impact
- Checkout flows failed with 500 errors
- Approximately 1,200 users affected
- Estimated revenue impact: $15,000
- No data loss or corruption
## Timeline
| Time | Event |
|------|-------|
| 2:15 PM PT | Datadog alert: 500 error rate spike on checkout endpoint |
| 2:17 PM PT | On-call engineer paged via Slack DM |
| 2:22 PM PT | Incident acknowledged, investigating logs |
| 2:35 PM PT | Root cause identified: database connection pool exhaustion |
| 2:40 PM PT | Hotfix deployed: connection pool size increased from 10 to 50 |
| 2:45 PM PT | Error rates returning to normal |
| 3:13 PM PT | Incident resolved |
## Root cause
The deployment of v2.3.0 increased the number of concurrent worker processes from 5 to 20 to handle increased traffic. Each worker process maintains its own database connection pool of 10 connections, resulting in 200 total connections. The database connection limit was configured at 150 connections, causing 50 connections to be rejected.
The deployment process did not validate that connection pool sizing matched the new worker process count. Additionally, no monitoring existed for connection pool usage.
## Resolution and recovery
**Immediate fix:** Increased database connection limit from 150 to 500 and reduced worker process connection pool size from 10 to 5, allowing 100 total connections with headroom.
**Permanent fix:**
- Connection pool sizing now calculated automatically based on worker count
- Added monitoring alert for connection pool usage above 70%
- Added pre-deployment checklist item to validate connection limits
**Rollback plan:** If fix didn't work, would have reverted to v2.2.0 with 5 worker processes.
## Follow-up actions
| Action | Owner | Due date | Status |
|--------|-------|----------|--------|
| Add Datadog monitoring for connection pool usage | @alice | 2026-01-17 | Complete |
| Update deployment runbook with connection pool sizing | @bob | 2026-01-18 | Complete |
- Add connection pool check to pre-deployment checklist | @carol | 2026-01-20 | Open |
| Load test database with 2x traffic before next worker scale | @dave | 2026-01-25 | Open |
## Lessons learned
**What went well:**
- On-call acknowledged within 5 minutes
- Root cause identified quickly due to clear error logs
- Hotfix deployed and verified within 30 minutes
**What could be improved:**
- Monitoring should have detected connection pool growth before exhaustion
- Deployment process should validate infrastructure changes
- Runbook didn't cover database connection issues
Need more?
- Incidents – Incident lifecycle and severity
- Slash Commands – Complete
/inccommand reference - On-Call – Scheduling and rotations
- Web Dashboard – Full postmortem management UI