Incidents
Learn how Runframe manages incidents from detection to postmortem.
Overview
An incident is an unplanned disruption or degradation of service that affects your users. Runframe provides a complete incident management system that helps your team respond quickly, communicate effectively, and learn from every incident.
What you get with every incident
| Feature | Benefit |
|---|---|
| Dedicated Slack channel | Focused coordination without noise |
| Automatic assignments | Right people notified immediately |
| SLA tracking | Clear deadlines for acknowledgment and resolution |
| Status updates | Everyone stays informed automatically |
| Escalation safeguards | Incidents never fall through the cracks |
| Postmortem workflow | Capture learnings while they’re fresh |
Incident lifecycle
Every incident in Runframe follows a structured lifecycle:
| Stage | Status | What happens |
|---|---|---|
| Detection | — | Incident identified, pager triggered |
| Acknowledged | investigating | Responder assigned, investigating root cause |
| Identified | identified | Root cause known, working on fix |
| Monitoring | monitoring | Fix deployed, watching stability |
| Resolved | resolved | Incident over, normal operation restored |
| Closed | — | Channel archived, documentation complete |
Status changes
Update incident status with Slack commands or the web dashboard:
/inc status investigating
/inc status identified
/inc status monitoring
/inc status resolved
Each status update:
- Posts to the incident channel
- Updates the dashboard incident detail page
- Records timestamps for MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) analytics
Use /inc resolve for complete workflow
/inc status resolved updates the status but doesn’t complete the incident workflow. Use /inc resolve to properly close and generate postmortem prompts.
Severity levels
Runframe uses five severity levels to classify incidents. Severity determines:
- SLA deadlines for acknowledgment and resolution
- Escalation timing and paths
- Who gets notified and how (Slack DM, SMS, phone call)
- Whether a postmortem is required
Severity definitions
| Severity | Name | Definition | Typical SLA | Postmortem |
|---|---|---|---|---|
| P0 | Critical | Complete service outage or data loss | 5 min / 30 min | Required |
| P1 | High | Major feature broken, significant user impact | 15 min / 1 hour | Required |
| P2 | Medium | Degraded performance, partial user impact | 30 min / 4 hours | Recommended |
| P3 | Low | Minor issue, workaround exists | 1 hour / 1 day | Optional |
| P4 | Trivial | Cosmetic issue, no user impact | 4 hours / 3 days | Skip |
Choosing severity
Avoid severity fatigue
Reserve P0 and P1 for true emergencies. Overusing high severities leads to alert fatigue and slower response when it really matters.
P0 (Critical)
- Complete service outage
- Data loss or corruption
- Security breach
- Complete loss of authentication
P1 (High)
- Major feature completely broken
- Significant performance degradation affecting all users
- Inability to process payments or critical transactions
- No workaround available
P2 (Medium)
- Degraded performance for some users
- Feature partially broken but workaround exists
- Increased error rates but core functionality works
- Single region or service impact
P3 (Low)
- Minor bug affecting few users
- Performance impact that doesn’t block workflows
- Admin or internal tool broken
- Clear workaround available
P4 (Trivial)
- Typo or cosmetic issue
- Documentation error
- Feature request disguised as incident
- No functional impact
Changing severity
Adjust severity as you learn more:
/inc severity p0
/inc severity p1
/inc severity p2
/inc severity p3
/inc severity p4
Severity changes update SLAs
When you change severity, Runframe recalculates acknowledgment and resolution deadlines. Escalation policies may be re-evaluated.
SLAs and escalation
Runframe automatically tracks SLA deadlines for every incident based on severity.
SLA deadlines
| SLA Type | Definition |
|---|---|
| Acknowledgment | Time to first responder assignment |
| Resolution | Time to incident resolved status |
SLA countdown
Every incident channel displays a live SLA countdown:
- Green: On track, more than 50% of SLA remaining
- Yellow: Approaching deadline, less than 50% of SLA remaining
- Red: SLA breached or imminent
Automatic escalation
When SLAs are breached, Runframe triggers escalation policies:
- First escalation - Notify incident lead or manager
- Second escalation - Page backup on-call or escalate to wider team
- Executive escalation - For P0/P0 incidents exceeding resolution SLA
Configure escalation policies in Settings or /guides/escalations.
Incident roles
For larger incidents, assign specific roles to clarify responsibilities:
/inc assign @username --role lead
/inc assign @username --role comms
/inc assign @username --role operations
| Role | Responsibility | Skills |
|---|---|---|
| Incident Lead | Overall coordination, status updates, escalation decisions | Communication, decisiveness |
| Comms Lead | Customer notifications, stakeholder updates, public statements | Clear writing, customer empathy |
| Operations Lead | Technical investigation, fix deployment, verification | Technical depth, system knowledge |
Role assignment is optional
For smaller incidents, a single responder often handles all responsibilities. Use roles for complex or high-severity incidents.
Customer impact
Track whether an incident affects customers:
- Customer-impacting incidents get higher visibility in dashboards
- SLAs may differ for customer-impacting vs. internal incidents
- Communication templates differ based on customer impact
Mark incidents as customer-impacting when creating or editing from the dashboard.
Related incidents
Link related incidents to track duplicates, dependencies, or recurring issues:
| Relationship | When to use |
|---|---|
| Duplicate | Multiple reports of the same issue |
| Depends on | This incident is blocked by another |
| Blocks | This incident prevents another from being resolved |
| Related to | Loose connection, worth noting |
| Caused by | This incident is a consequence of another |
| Root cause of | This incident led to another |
Link related incidents from the dashboard incident detail page.
Communication best practices
Good incident communication reduces stress and builds trust.
During the incident
Do:
- Update regularly – Even “still investigating” manages expectations
- Be transparent – Share what you know and what you don’t
- Provide ETAs – Even if it’s “no ETA yet, next update in 30 minutes”
- Use threads for side discussions – Keep the main channel focused on status
- Escalate early if stuck – Use
/inc pageor escalate manually
Don’t:
- Don’t speculate – Only report confirmed information
- Don’t blame – Focus on the system, not individuals
- Don’t hide bad news – Earlier is better
- Don’t go silent – Silence causes more anxiety than bad news
Status update template
/inc status identified
Root cause: Database connection pool exhaustion causing 500 errors.
Impact: Checkout flows failing with 500 errors.
Fix: Increasing connection pool size and deploying patch.
ETA: 15 minutes to deployment, 30 minutes to verify stability.
Resolution summary template
When resolving incidents, provide context for postmortems:
Fixed database connection pool leak. Deployed patch v2.3.1.
Connection pool size increased from 10 to 50. Added monitoring for pool exhaustion.
Error rates back to normal. Monitoring for 24 hours.
Analytics and insights
Runframe tracks incident metrics to help you improve:
MTTA and MTTR
| Metric | Full Name | Definition |
|---|---|---|
| MTTA | Mean Time to Acknowledge | Average time from incident creation to first responder assignment |
| MTTR | Mean Time to Resolve | Average time from incident creation to resolved status |
View trends in the Analytics section of the dashboard.
Incident frequency
Track incidents over time by severity, service, or team to identify patterns and systemic issues.
Need more?
- Slash Commands – Complete
/inccommand reference - On-Call – Scheduling and rotations
- Postmortems – Learning from incidents
- Escalations – Configure escalation policies
- Web Dashboard – Full incident management UI