Skip to Content
GuidesIncidents

Incidents

Learn how Runframe manages incidents from detection to postmortem.


Overview

An incident is an unplanned disruption or degradation of service that affects your users. Runframe provides a complete incident management system that helps your team respond quickly, communicate effectively, and learn from every incident.

What you get with every incident

FeatureBenefit
Dedicated Slack channelFocused coordination without noise
Automatic assignmentsRight people notified immediately
SLA trackingClear deadlines for acknowledgment and resolution
Status updatesEveryone stays informed automatically
Escalation safeguardsIncidents never fall through the cracks
Postmortem workflowCapture learnings while they’re fresh

Incident lifecycle

Every incident in Runframe follows a structured lifecycle:

StageStatusWhat happens
DetectionIncident identified, pager triggered
AcknowledgedinvestigatingResponder assigned, investigating root cause
IdentifiedidentifiedRoot cause known, working on fix
MonitoringmonitoringFix deployed, watching stability
ResolvedresolvedIncident over, normal operation restored
ClosedChannel archived, documentation complete

Status changes

Update incident status with Slack commands or the web dashboard:

/inc status investigating
/inc status identified
/inc status monitoring
/inc status resolved

Each status update:

  • Posts to the incident channel
  • Updates the dashboard incident detail page
  • Records timestamps for MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) analytics

Use /inc resolve for complete workflow

/inc status resolved updates the status but doesn’t complete the incident workflow. Use /inc resolve to properly close and generate postmortem prompts.


Severity levels

Runframe uses five severity levels to classify incidents. Severity determines:

  • SLA deadlines for acknowledgment and resolution
  • Escalation timing and paths
  • Who gets notified and how (Slack DM, SMS, phone call)
  • Whether a postmortem is required

Severity definitions

SeverityNameDefinitionTypical SLAPostmortem
P0CriticalComplete service outage or data loss5 min / 30 minRequired
P1HighMajor feature broken, significant user impact15 min / 1 hourRequired
P2MediumDegraded performance, partial user impact30 min / 4 hoursRecommended
P3LowMinor issue, workaround exists1 hour / 1 dayOptional
P4TrivialCosmetic issue, no user impact4 hours / 3 daysSkip

Choosing severity

Avoid severity fatigue

Reserve P0 and P1 for true emergencies. Overusing high severities leads to alert fatigue and slower response when it really matters.

P0 (Critical)

  • Complete service outage
  • Data loss or corruption
  • Security breach
  • Complete loss of authentication

P1 (High)

  • Major feature completely broken
  • Significant performance degradation affecting all users
  • Inability to process payments or critical transactions
  • No workaround available

P2 (Medium)

  • Degraded performance for some users
  • Feature partially broken but workaround exists
  • Increased error rates but core functionality works
  • Single region or service impact

P3 (Low)

  • Minor bug affecting few users
  • Performance impact that doesn’t block workflows
  • Admin or internal tool broken
  • Clear workaround available

P4 (Trivial)

  • Typo or cosmetic issue
  • Documentation error
  • Feature request disguised as incident
  • No functional impact

Changing severity

Adjust severity as you learn more:

/inc severity p0
/inc severity p1
/inc severity p2
/inc severity p3
/inc severity p4

Severity changes update SLAs

When you change severity, Runframe recalculates acknowledgment and resolution deadlines. Escalation policies may be re-evaluated.


SLAs and escalation

Runframe automatically tracks SLA deadlines for every incident based on severity.

SLA deadlines

SLA TypeDefinition
AcknowledgmentTime to first responder assignment
ResolutionTime to incident resolved status

SLA countdown

Every incident channel displays a live SLA countdown:

  • Green: On track, more than 50% of SLA remaining
  • Yellow: Approaching deadline, less than 50% of SLA remaining
  • Red: SLA breached or imminent

Automatic escalation

When SLAs are breached, Runframe triggers escalation policies:

  1. First escalation - Notify incident lead or manager
  2. Second escalation - Page backup on-call or escalate to wider team
  3. Executive escalation - For P0/P0 incidents exceeding resolution SLA

Configure escalation policies in Settings or /guides/escalations.


Incident roles

For larger incidents, assign specific roles to clarify responsibilities:

/inc assign @username --role lead
/inc assign @username --role comms
/inc assign @username --role operations
RoleResponsibilitySkills
Incident LeadOverall coordination, status updates, escalation decisionsCommunication, decisiveness
Comms LeadCustomer notifications, stakeholder updates, public statementsClear writing, customer empathy
Operations LeadTechnical investigation, fix deployment, verificationTechnical depth, system knowledge

Role assignment is optional

For smaller incidents, a single responder often handles all responsibilities. Use roles for complex or high-severity incidents.


Customer impact

Track whether an incident affects customers:

  • Customer-impacting incidents get higher visibility in dashboards
  • SLAs may differ for customer-impacting vs. internal incidents
  • Communication templates differ based on customer impact

Mark incidents as customer-impacting when creating or editing from the dashboard.


Link related incidents to track duplicates, dependencies, or recurring issues:

RelationshipWhen to use
DuplicateMultiple reports of the same issue
Depends onThis incident is blocked by another
BlocksThis incident prevents another from being resolved
Related toLoose connection, worth noting
Caused byThis incident is a consequence of another
Root cause ofThis incident led to another

Link related incidents from the dashboard incident detail page.


Communication best practices

Good incident communication reduces stress and builds trust.

During the incident

Do:

  • Update regularly – Even “still investigating” manages expectations
  • Be transparent – Share what you know and what you don’t
  • Provide ETAs – Even if it’s “no ETA yet, next update in 30 minutes”
  • Use threads for side discussions – Keep the main channel focused on status
  • Escalate early if stuck – Use /inc page or escalate manually

Don’t:

  • Don’t speculate – Only report confirmed information
  • Don’t blame – Focus on the system, not individuals
  • Don’t hide bad news – Earlier is better
  • Don’t go silent – Silence causes more anxiety than bad news

Status update template

/inc status identified

Root cause: Database connection pool exhaustion causing 500 errors.
Impact: Checkout flows failing with 500 errors.
Fix: Increasing connection pool size and deploying patch.
ETA: 15 minutes to deployment, 30 minutes to verify stability.

Resolution summary template

When resolving incidents, provide context for postmortems:

Fixed database connection pool leak. Deployed patch v2.3.1.
Connection pool size increased from 10 to 50. Added monitoring for pool exhaustion.
Error rates back to normal. Monitoring for 24 hours.

Analytics and insights

Runframe tracks incident metrics to help you improve:

MTTA and MTTR

MetricFull NameDefinition
MTTAMean Time to AcknowledgeAverage time from incident creation to first responder assignment
MTTRMean Time to ResolveAverage time from incident creation to resolved status

View trends in the Analytics section of the dashboard.

Incident frequency

Track incidents over time by severity, service, or team to identify patterns and systemic issues.


Need more?

Last updated on