When high-impact incidents occur - such as website outages, data loss, or critical bugs in production - fixing the immediate issue isn't enough.
Without a structured process to capture what happened, identify the root causes, and implement improvements, the same problems can happen again. This rule defines a simple post-incident process to help teams learn from failure, prevent repeat issues, and improve software and business processes.
As soon as possible, assign someone to record the key details:
Capture these in a central place - if your monitoring system creates PBIs automatically, use comments in the ticket to log the incident timeline and key facts.
Hold a blameless post-incident review with everyone involved. Use structured techniques like:
Tip: Don't stop at technical causes - also consider process gaps, unclear responsibilities, or communication failures.
For each contributing factor, define clear and actionable recommendations:
Each recommendation must have a dedicated PBI. The Product Owner is responsible for ensuring these PBIs are estimated, prioritised, and scheduled. Teams should review them during Sprint Planning or Backlog Refinement.
A well-handled incident isn't just about restoring service - it's a chance to make meaningful improvements.
By recording incidents, analysing causes, and implementing clear actions, teams reduce risk, increase reliability, and turn failures into progress.