Incidents

Learn how Runframe manages incidents from detection to postmortem.

Overview

An incident is an unplanned disruption or degradation of service that affects your users. Runframe provides a complete incident management system that helps your team respond quickly, communicate effectively, and learn from every incident.

What you get with every incident

Feature	Benefit
Dedicated Slack channel	Focused coordination without noise
Automatic assignments	Right people notified immediately
SLA tracking	Clear deadlines for acknowledgment and resolution
Status updates	Everyone stays informed automatically
Escalation safeguards	Incidents never fall through the cracks
Postmortem workflow	Capture learnings while they’re fresh

Incident lifecycle

Every incident in Runframe follows a structured lifecycle:

Stage	Status	What happens
Detection	—	Incident identified, pager triggered
Acknowledged	`investigating`	Responder assigned, investigating root cause
Identified	`identified`	Root cause known, working on fix
Monitoring	`monitoring`	Fix deployed, watching stability
Resolved	`resolved`	Incident over, normal operation restored
Closed	—	Channel archived, documentation complete

Status changes

Update incident status with Slack commands or the web dashboard:

/inc move investigating
/inc move identified
/inc move monitoring
/inc move resolved

Each status update:

Posts to the incident channel
Updates the dashboard incident detail page
Records timestamps for MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) analytics

Use /inc resolve for complete workflow

/inc move resolved updates the status but doesn’t complete the incident workflow. Use /inc resolve to properly close and generate postmortem prompts.

Severity levels

Runframe uses five severity levels to classify incidents. Severity determines:

SLA deadlines for acknowledgment and resolution
Escalation timing and paths
Who gets notified and how (Slack DM, SMS, email)
Whether a postmortem is required

Severity definitions

Name	Also Known As	Definition	Typical SLA	Postmortem
Critical (SEV0)	P0	Complete service outage or data loss	5 min / 30 min	Required
High (SEV1)	P1	Major feature broken, significant user impact	15 min / 1 hour	Required
Medium (SEV2)	P2	Degraded performance, partial user impact	30 min / 4 hours	Recommended
Low (SEV3)	P3	Minor issue, workaround exists	1 hour / 1 day	Optional
Pre-emptive (SEV4)	P4	Cosmetic issue, no user impact	4 hours / 3 days	Skip

Choosing severity

Avoid severity fatigue

Reserve Critical and High for true emergencies. Overusing high severities leads to alert fatigue and slower response when it really matters.

Critical (SEV0 / P0)

Complete service outage
Data loss or corruption
Security breach
Complete loss of authentication

High (SEV1 / P1)

Major feature completely broken
Significant performance degradation affecting all users
Inability to process payments or critical transactions
No workaround available

Medium (SEV2 / P2)

Degraded performance for some users
Feature partially broken but workaround exists
Increased error rates but core functionality works
Single region or service impact

Low (SEV3 / P3)

Minor bug affecting few users
Performance impact that doesn’t block workflows
Admin or internal tool broken
Clear workaround available

Pre-emptive (SEV4 / P4)

Typo or cosmetic issue
Documentation error
Feature request disguised as incident
No functional impact

Changing severity

Adjust severity as you learn more using friendly names:

/inc severity critical
/inc severity high
/inc severity medium
/inc severity low
/inc severity pre-emptive

Or use SEV notation:

/inc severity SEV0
/inc severity SEV1
/inc severity SEV2
/inc severity SEV3
/inc severity SEV4

Severity changes update SLAs

When you change severity, Runframe recalculates acknowledgment and resolution deadlines. Escalation policies may be re-evaluated.

SLAs and escalation

Runframe automatically tracks SLA deadlines for every incident based on severity.

SLA deadlines

SLA Type	Definition
Acknowledgment	Time to first responder assignment
Resolution	Time to incident resolved status

SLA countdown

Every incident channel displays a live SLA countdown:

Green: On track, more than 50% of SLA remaining
Yellow: Approaching deadline, less than 50% of SLA remaining
Red: SLA breached or imminent

Automatic escalation

When SLAs are breached, Runframe triggers escalation policies:

First escalation - Notify incident lead or manager
Second escalation - Page backup on-call or escalate to wider team
Executive escalation - For Critical/High incidents exceeding resolution SLA

Configure escalation policies in Settings or /guides/escalations.

Incident roles

For larger incidents, assign specific roles to clarify responsibilities:

/inc assign @username --role lead
/inc assign @username --role comms
/inc assign @username --role operations

Role	Responsibility	Skills
Incident Lead	Overall coordination, status updates, escalation decisions	Communication, decisiveness
Comms Lead	Customer notifications, stakeholder updates, public statements	Clear writing, customer empathy
Operations Lead	Technical investigation, fix deployment, verification	Technical depth, system knowledge

Role assignment is optional

For smaller incidents, a single responder often handles all responsibilities. Use roles for complex or high-severity incidents.

Customer impact

Track whether an incident affects customers:

Customer-impacting incidents get higher visibility in dashboards
SLAs may differ for customer-impacting vs. internal incidents
Communication templates differ based on customer impact

Mark incidents as customer-impacting when creating or editing from the dashboard.

Link related incidents to track duplicates, dependencies, or recurring issues:

Relationship	When to use
Duplicate	Multiple reports of the same issue
Depends on	This incident is blocked by another
Blocks	This incident prevents another from being resolved
Related to	Loose connection, worth noting
Caused by	This incident is a consequence of another
Root cause of	This incident led to another

Link related incidents from the dashboard incident detail page.

Communication best practices

Good incident communication reduces stress and builds trust.

During the incident

Do:

Update regularly – Even “still investigating” manages expectations
Be transparent – Share what you know and what you don’t
Provide ETAs – Even if it’s “no ETA yet, next update in 30 minutes”
Use threads for side discussions – Keep the main channel focused on status
Escalate early if stuck – Use /inc page or escalate manually

Don’t:

Don’t speculate – Only report confirmed information
Don’t blame – Focus on the system, not individuals
Don’t hide bad news – Earlier is better
Don’t go silent – Silence causes more anxiety than bad news

Status update template

/inc move identified

Root cause: Database connection pool exhaustion causing 500 errors.
Impact: Checkout flows failing with 500 errors.
Fix: Increasing connection pool size and deploying patch.
ETA: 15 minutes to deployment, 30 minutes to verify stability.

Resolution summary template

When resolving incidents, provide context for postmortems:

Fixed database connection pool leak. Deployed patch v2.3.1.
Connection pool size increased from 10 to 50. Added monitoring for pool exhaustion.
Error rates back to normal. Monitoring for 24 hours.

Analytics and insights

Runframe tracks incident metrics to help you improve:

MTTA and MTTR

Metric	Full Name	Definition
MTTA	Mean Time to Acknowledge	Average time from incident creation to first responder assignment
MTTR	Mean Time to Resolve	Average time from incident creation to resolved status

View trends in the Analytics section of the dashboard.

Incident frequency

Track incidents over time by severity, service, or team to identify patterns and systemic issues.

Need more?

Slash Commands – Complete /inc command reference
On-Call – Scheduling and rotations
Postmortems – Learning from incidents
Escalations – Configure escalation policies
Web Dashboard – Full incident management UI