Postmortems

Capture learnings from incidents and prevent recurrence.

Overview

A postmortem is a written record of an incident that answers: What happened? Why did it happen? What are we doing to prevent it from happening again?

Runframe makes postmortems easy by pre-filling incident data and tracking follow-up items.

Why write postmortems?

Benefit	Explanation
Institutional memory – New team members learn from past incidents
Systemic improvements – Identify and fix underlying issues
Blameless culture – Focus on systems, not individuals
Customer communication – Explain what happened and what you’re doing
Compliance – Many organizations require incident documentation

When to write postmortems

Severity	Postmortem recommendation
P0	Always required
P1	Always required
P2	Recommended
P3	Optional
P4	Skip unless it reveals a systemic issue

Creating a postmortem

Automatic creation from incidents

When you resolve an incident, Runframe prompts you to start a postmortem:

/inc resolve

You’ll be asked:

Provide a resolution summary
Whether to start a postmortem
Options: Yes, Later, or Skip

If you choose Yes, Runframe creates a PM-### document with:

Incident title and number
Severity and customer impact flag
Timeline of status changes and key events
Assigned responders
Resolution summary
Pre-populated sections to complete

Manual creation

Create a postmortem from the dashboard:

Navigate to Incidents in the sidebar
Click on the resolved incident
Click Start Postmortem in the top right
Runframe creates the PM-### document

Creating standalone postmortems

For incidents that occurred outside Runframe:

Navigate to Postmortems in the sidebar
Click New Postmortem
Fill in the incident details manually

Postmortem template

Every postmortem follows a consistent structure for readability and searchability.

Field	Description
PM-###	Unique postmortem identifier
Incident INC-###	Links to the original incident
Title	Brief summary of the incident
Severity	P0 through P4
Date	When the incident occurred
Duration	Time from detection to resolution
Customer impact	Whether customers were affected

Sections

Summary

A 2 to 3 sentence overview. Write this last—it should synthesize the full postmortem.

Impact

What happened and who was affected:

Which services were down or degraded?
How many users were affected?
What functionality was broken?
Was data lost or corrupted?

Timeline

Chronological sequence of events. Use specific timestamps:

Time	Event
2:15 PM PT	Incident detected via Datadog alert
2:17 PM PT	On-call engineer paged
2:22 PM PT	Incident acknowledged, investigating
2:35 PM PT	Root cause identified: database connection pool exhaustion
2:45 PM PT	Hotfix deployed to production
2:50 PM PT	Error rates returning to normal
3:15 PM PT	Incident resolved

Root cause

The technical explanation of what went wrong. Be specific:

What code or configuration change caused this?
What was the cascade of failures?
What monitoring was missing?
What made this hard to detect or fix?

Resolution and recovery

How you fixed it:

What was the immediate fix?
What was the permanent fix?
How long did each step take?
What was the rollback plan if the fix didn’t work?

Follow-up actions

Specific, actionable items to prevent recurrence. Each action should have:

Action	Owner	Due date	Status
Add monitoring for connection pool usage	@alice	2026-01-15	Open
Increase connection pool size from 10 to 50	@bob	2026-01-10	Complete
Run load tests before database migrations	@carol	2026-01-20	Open

Lessons learned

What went well and what didn’t:

What went well:

On-call responded within SLA
Hotfix was deployed quickly
Communication to stakeholders was clear

What could be improved:

Monitoring should have detected connection pool growth earlier
Runbook for database issues didn’t cover this scenario
Escalation to database specialist was delayed

Be blameless

Postmortems focus on systems and processes, not individuals. Never blame specific people. Instead of “@alice forgot to update the config,” write “The deployment process didn’t validate config changes.”

Writing a great postmortem

Do’s

Be specific – Exact times, specific services, precise error messages
Be concise – Most postmortems should be under 1,000 words
Include data – Metrics, screenshots, graphs from monitoring tools
Assign actions – Every follow-up should have an owner and due date
Share learnings – Distribute to the wider team, not just participants

Don’ts

Don’t assign blame – Systems fail, not people
Don’t be vague – “We’ll be more careful next time” is not a follow-up action
Don’t bury the lede – Put the most important information in the summary
Don’t forget the customer – Even internal incidents affect someone

Postmortem review process

After drafting:

Technical review – Have a senior engineer verify root cause analysis
Comms review – Ensure customer impact is accurate if sharing externally
Action item review – Verify all actions are specific and assigned

Managing follow-up actions

Runframe tracks follow-up actions from postmortems to ensure completion.

Creating actions

From the postmortem editor, add follow-up actions:

Description – What needs to be done
Owner – Who is responsible (assign to a user)
Due date – When it should be completed
Priority – P0 through P4

Tracking actions

View all open actions from:

Postmortems → Action Items in the dashboard
Individual postmortem pages
Slack notifications for overdue items

Closing actions

When an action is complete:

Navigate to the postmortem
Mark the action as Complete
Add a brief note explaining what was done
Runframe records the completion timestamp

Overdue actions

Runframe notifies:

The action owner when due date approaches
The postmortem author when actions are overdue
Team managers for critical overdue actions

Action items create a feedback loop

Completing postmortem actions reduces future incidents. Track action completion rate as a metric of incident response effectiveness.

Postmortem culture

A healthy postmortem culture accelerates learning and builds trust.

Blameless postmortems

Blameless doesn’t mean no accountability—it means:

Accountability to systems – Did our processes, tooling, or training fail?
Psychological safety – People feel safe sharing mistakes without fear of retribution
Focus on improvement – Energy goes into preventing recurrence, not assigning fault

When something goes wrong, ask:

“What system failed?” not “Who messed up?”
“What process allowed this?” not “Why didn’t you catch this?”
“How do we prevent this?” not “Who’s responsible?”

Making time for postmortems

Postmortems take time away from feature work, but pay dividends in reduced incidents:

Schedule postmortem time – Block 1 to 2 hours within 48 hours of resolution
Write while fresh – Details fade quickly; write soon after incidents
Collaborative drafting – Two heads are better than one; write together

Decide how broadly to share based on severity and customer impact:

Severity	Internal sharing	External sharing
P0	All-hands, executive summary	Public blog post, customer notification
P1	Engineering-wide, relevant teams	Customer notification if impact warrants
P2	Team-specific, postmortem channel	Typically internal only
P3-P4	Team-specific	Internal only

Postmortem analytics

Track postmortem metrics to improve your incident response:

Metric	Definition	Goal
Postmortem completion rate	P0-P1 incidents with postmortems	100%
Time to postmortem	Days from incident resolution to postmortem completion	Under 7 days
Action completion rate	Follow-up actions completed within 30 days	Above 80%
Recurrence rate	Incidents caused by same root cause within 90 days	Under 5%

View metrics in the Analytics section of the dashboard.

Examples

Example postmortem: Database connection pool exhaustion

# PM-001: Database connection pool exhaustion
 
**Incident:** INC-042
**Severity:** High
**Date:** January 15, 2026
**Duration:** 58 minutes
**Customer impact:** Yes – checkout flows failing
 
## Summary
 
A database connection pool exhaustion caused 500 errors on checkout flows for 58 minutes. Root cause was a recent deployment that increased concurrent query capacity without adjusting connection pool size. Fixed by increasing pool size and adding monitoring for pool usage.
 
## Impact
 
- Checkout flows failed with 500 errors
- Approximately 1,200 users affected
- Estimated revenue impact: $15,000
- No data loss or corruption
 
## Timeline
 
| Time | Event |
|------|-------|
| 2:15 PM PT | Datadog alert: 500 error rate spike on checkout endpoint |
| 2:17 PM PT | On-call engineer paged via Slack DM |
| 2:22 PM PT | Incident acknowledged, investigating logs |
| 2:35 PM PT | Root cause identified: database connection pool exhaustion |
| 2:40 PM PT | Hotfix deployed: connection pool size increased from 10 to 50 |
| 2:45 PM PT | Error rates returning to normal |
| 3:13 PM PT | Incident resolved |
 
## Root cause
 
The deployment of v2.3.0 increased the number of concurrent worker processes from 5 to 20 to handle increased traffic. Each worker process maintains its own database connection pool of 10 connections, resulting in 200 total connections. The database connection limit was configured at 150 connections, causing 50 connections to be rejected.
 
The deployment process did not validate that connection pool sizing matched the new worker process count. Additionally, no monitoring existed for connection pool usage.
 
## Resolution and recovery
 
**Immediate fix:** Increased database connection limit from 150 to 500 and reduced worker process connection pool size from 10 to 5, allowing 100 total connections with headroom.
 
**Permanent fix:**
- Connection pool sizing now calculated automatically based on worker count
- Added monitoring alert for connection pool usage above 70%
- Added pre-deployment checklist item to validate connection limits
 
**Rollback plan:** If fix didn't work, would have reverted to v2.2.0 with 5 worker processes.
 
## Follow-up actions
 
| Action | Owner | Due date | Status |
|--------|-------|----------|--------|
| Add Datadog monitoring for connection pool usage | @alice | 2026-01-17 | Complete |
| Update deployment runbook with connection pool sizing | @bob | 2026-01-18 | Complete |
- Add connection pool check to pre-deployment checklist | @carol | 2026-01-20 | Open |
| Load test database with 2x traffic before next worker scale | @dave | 2026-01-25 | Open |
 
## Lessons learned
 
**What went well:**
- On-call acknowledged within 5 minutes
- Root cause identified quickly due to clear error logs
- Hotfix deployed and verified within 30 minutes
 
**What could be improved:**
- Monitoring should have detected connection pool growth before exhaustion
- Deployment process should validate infrastructure changes
- Runbook didn't cover database connection issues

Need more?

Incidents – Incident lifecycle and severity
Slash Commands – Complete /inc command reference
On-Call – Scheduling and rotations
Web Dashboard – Full postmortem management UI