Skip to Content
GuidesIncidents

Incidents

Learn how Runframe manages incidents from detection to postmortem.


Overview

An incident is an unplanned disruption or degradation of service that affects your users. Runframe provides a complete incident management system that helps your team respond quickly, communicate effectively, and learn from every incident.

What you get with every incident

FeatureBenefit
Dedicated Slack channelFocused coordination without noise
Automatic assignmentsRight people notified immediately
SLA trackingClear deadlines for acknowledgment and resolution
Status updatesEveryone stays informed automatically
Escalation safeguardsIncidents never fall through the cracks
Postmortem workflowCapture learnings while they’re fresh

Incident lifecycle

Every incident in Runframe follows a structured lifecycle:

StageStatusWhat happens
DetectionIncident identified, pager triggered
AcknowledgedinvestigatingResponder assigned, investigating root cause
IdentifiedidentifiedRoot cause known, working on fix
MonitoringmonitoringFix deployed, watching stability
ResolvedresolvedIncident over, normal operation restored
ClosedChannel archived, documentation complete

Status changes

Update incident status with Slack commands or the web dashboard:

/inc move investigating
/inc move identified
/inc move monitoring
/inc move resolved

Each status update:

  • Posts to the incident channel
  • Updates the dashboard incident detail page
  • Records timestamps for MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) analytics

Use /inc resolve for complete workflow

/inc move resolved updates the status but doesn’t complete the incident workflow. Use /inc resolve to properly close and generate postmortem prompts.


Severity levels

Runframe uses five severity levels to classify incidents. Severity determines:

  • SLA deadlines for acknowledgment and resolution
  • Escalation timing and paths
  • Who gets notified and how (Slack DM, SMS, phone call)
  • Whether a postmortem is required

Severity definitions

NameAlso Known AsDefinitionTypical SLAPostmortem
Critical (SEV0)P0Complete service outage or data loss5 min / 30 minRequired
High (SEV1)P1Major feature broken, significant user impact15 min / 1 hourRequired
Medium (SEV2)P2Degraded performance, partial user impact30 min / 4 hoursRecommended
Low (SEV3)P3Minor issue, workaround exists1 hour / 1 dayOptional
Pre-emptive (SEV4)P4Cosmetic issue, no user impact4 hours / 3 daysSkip

Choosing severity

Avoid severity fatigue

Reserve Critical and High for true emergencies. Overusing high severities leads to alert fatigue and slower response when it really matters.

Critical (SEV0 / P0)

  • Complete service outage
  • Data loss or corruption
  • Security breach
  • Complete loss of authentication

High (SEV1 / P1)

  • Major feature completely broken
  • Significant performance degradation affecting all users
  • Inability to process payments or critical transactions
  • No workaround available

Medium (SEV2 / P2)

  • Degraded performance for some users
  • Feature partially broken but workaround exists
  • Increased error rates but core functionality works
  • Single region or service impact

Low (SEV3 / P3)

  • Minor bug affecting few users
  • Performance impact that doesn’t block workflows
  • Admin or internal tool broken
  • Clear workaround available

Pre-emptive (SEV4 / P4)

  • Typo or cosmetic issue
  • Documentation error
  • Feature request disguised as incident
  • No functional impact

Changing severity

Adjust severity as you learn more using friendly names:

/inc severity critical
/inc severity high
/inc severity medium
/inc severity low
/inc severity pre-emptive

Or use SEV notation:

/inc severity SEV0
/inc severity SEV1
/inc severity SEV2
/inc severity SEV3
/inc severity SEV4

Severity changes update SLAs

When you change severity, Runframe recalculates acknowledgment and resolution deadlines. Escalation policies may be re-evaluated.


SLAs and escalation

Runframe automatically tracks SLA deadlines for every incident based on severity.

SLA deadlines

SLA TypeDefinition
AcknowledgmentTime to first responder assignment
ResolutionTime to incident resolved status

SLA countdown

Every incident channel displays a live SLA countdown:

  • Green: On track, more than 50% of SLA remaining
  • Yellow: Approaching deadline, less than 50% of SLA remaining
  • Red: SLA breached or imminent

Automatic escalation

When SLAs are breached, Runframe triggers escalation policies:

  1. First escalation - Notify incident lead or manager
  2. Second escalation - Page backup on-call or escalate to wider team
  3. Executive escalation - For Critical/High incidents exceeding resolution SLA

Configure escalation policies in Settings or /guides/escalations.


Incident roles

For larger incidents, assign specific roles to clarify responsibilities:

/inc assign @username --role lead
/inc assign @username --role comms
/inc assign @username --role operations
RoleResponsibilitySkills
Incident LeadOverall coordination, status updates, escalation decisionsCommunication, decisiveness
Comms LeadCustomer notifications, stakeholder updates, public statementsClear writing, customer empathy
Operations LeadTechnical investigation, fix deployment, verificationTechnical depth, system knowledge

Role assignment is optional

For smaller incidents, a single responder often handles all responsibilities. Use roles for complex or high-severity incidents.


Customer impact

Track whether an incident affects customers:

  • Customer-impacting incidents get higher visibility in dashboards
  • SLAs may differ for customer-impacting vs. internal incidents
  • Communication templates differ based on customer impact

Mark incidents as customer-impacting when creating or editing from the dashboard.


Link related incidents to track duplicates, dependencies, or recurring issues:

RelationshipWhen to use
DuplicateMultiple reports of the same issue
Depends onThis incident is blocked by another
BlocksThis incident prevents another from being resolved
Related toLoose connection, worth noting
Caused byThis incident is a consequence of another
Root cause ofThis incident led to another

Link related incidents from the dashboard incident detail page.


Communication best practices

Good incident communication reduces stress and builds trust.

During the incident

Do:

  • Update regularly – Even “still investigating” manages expectations
  • Be transparent – Share what you know and what you don’t
  • Provide ETAs – Even if it’s “no ETA yet, next update in 30 minutes”
  • Use threads for side discussions – Keep the main channel focused on status
  • Escalate early if stuck – Use /inc page or escalate manually

Don’t:

  • Don’t speculate – Only report confirmed information
  • Don’t blame – Focus on the system, not individuals
  • Don’t hide bad news – Earlier is better
  • Don’t go silent – Silence causes more anxiety than bad news

Status update template

/inc move identified

Root cause: Database connection pool exhaustion causing 500 errors.
Impact: Checkout flows failing with 500 errors.
Fix: Increasing connection pool size and deploying patch.
ETA: 15 minutes to deployment, 30 minutes to verify stability.

Resolution summary template

When resolving incidents, provide context for postmortems:

Fixed database connection pool leak. Deployed patch v2.3.1.
Connection pool size increased from 10 to 50. Added monitoring for pool exhaustion.
Error rates back to normal. Monitoring for 24 hours.

Analytics and insights

Runframe tracks incident metrics to help you improve:

MTTA and MTTR

MetricFull NameDefinition
MTTAMean Time to AcknowledgeAverage time from incident creation to first responder assignment
MTTRMean Time to ResolveAverage time from incident creation to resolved status

View trends in the Analytics section of the dashboard.

Incident frequency

Track incidents over time by severity, service, or team to identify patterns and systemic issues.


Need more?

Last updated on