The investigation is complete. The root cause is understood. Now what?
A post-mortem review is where incident response transforms into organizational learning. Done well, it prevents recurrence and improves capability. Done poorly—or skipped entirely—it guarantees you'll face the same problems again.
This guide provides a practical framework for conducting AI incident post-mortems that drive genuine improvement.
Executive Summary
- Post-mortems are for learning, not blame: Create psychological safety for honest discussion
- Structure enables consistency: Use a standard format so post-mortems are comparable and complete
- Action items must be tracked: Insights without implementation are worthless
- Timing matters: Too soon and emotions interfere; too late and memory fades
- Include the right people: Those who responded, those who will prevent, and those who can authorize changes
- Share learnings: Organizational learning requires dissemination beyond the immediate team
- Follow through: Post-mortems are only valuable if recommendations are implemented
Why This Matters Now
Most organizations skip or shortcut post-mortems. The incident is over, everyone's tired, and there's pressure to move on. This is a mistake.
Without effective post-mortems:
- The same incidents keep happening
- Teams don't learn from each other's experiences
- Root causes go unaddressed
- Confidence in AI systems erodes
Organizations that invest in post-mortems build resilience. They don't just respond to incidents—they prevent them.
When to Conduct a Post-Mortem
Always conduct post-mortems for:
- Critical severity incidents
- High severity incidents
- Any incident requiring regulatory notification
- Incidents with external impact
- Incidents revealing systemic issues
- Novel incident types
Consider post-mortems for:
- Medium severity incidents
- Near-misses that could have been serious
- Incidents with valuable learning potential
- Request from team members
May skip formal post-mortems for:
- Low severity, routine incidents with known causes
- Incidents already covered by recent similar post-mortems
Post-Mortem Process
Step 1: Schedule and Prepare
Timing: 3-7 days after incident closure
- Soon enough that memory is fresh
- Late enough that emotions have settled
Attendees:
- Incident response team members
- System owners/operators
- Relevant technical experts
- Management stakeholder (to authorize changes)
- Facilitator (ideally not involved in incident)
Preparation:
- Distribute investigation report in advance
- Gather timeline and key facts
- Send pre-read to all participants
- Reserve 90-120 minutes
Step 2: Facilitate the Session
Ground Rules (Facilitator sets these at the start)
- Blameless: We're here to improve systems, not assign blame
- Assume good intentions: Everyone did what made sense to them at the time
- Focus on facts: What happened, not what we imagine happened
- Seek to understand: Ask questions before judging
- All perspectives welcome: Everyone's view adds value
- Constructive only: Criticize problems, not people
Agenda:
| Time | Topic | Purpose |
|---|---|---|
| 10 min | Context setting | Review incident summary, set ground rules |
| 20 min | Timeline review | Walk through what happened |
| 20 min | What went well | Identify what worked in response |
| 30 min | What went wrong | Identify failures and contributing factors |
| 20 min | Improvement actions | Develop specific, actionable improvements |
| 10 min | Wrap-up | Confirm action items, assign owners |
Step 3: Document Findings
Post-Mortem Report Template
AI INCIDENT POST-MORTEM
Incident ID: [ID]
Date of Incident: [Date]
Date of Post-Mortem: [Date]
Facilitator: [Name]
Attendees: [Names and roles]
INCIDENT SUMMARY
[2-3 paragraph summary of what happened]
TIMELINE
[Key events with timestamps]
- [Time]: [Event]
- [Time]: [Event]
IMPACT
- Users affected: [Number]
- Duration: [Time]
- Business impact: [Description]
- Other impacts: [Description]
ROOT CAUSES
1. [Primary root cause]
- Contributing factor: [Detail]
- Contributing factor: [Detail]
2. [Secondary root cause if applicable]
- Contributing factor: [Detail]
WHAT WENT WELL
- [Item 1]
- [Item 2]
- [Item 3]
WHAT WENT WRONG
- [Item 1]: [Description and impact]
- [Item 2]: [Description and impact]
- [Item 3]: [Description and impact]
WHERE WE GOT LUCKY
[Things that could have made this worse but didn't]
- [Item 1]
LESSONS LEARNED
1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]
ACTION ITEMS
| # | Action | Owner | Due Date | Status |
|---|--------|-------|----------|--------|
| 1 | [Specific action] | [Name] | [Date] | Open |
| 2 | [Specific action] | [Name] | [Date] | Open |
| 3 | [Specific action] | [Name] | [Date] | Open |
METRICS TO TRACK
- [Metric that would indicate improvement]
- [Metric that would indicate recurrence]
FOLLOW-UP
- Next review date: [Date]
- Distribution: [Who receives this document]
Step 4: Track Action Items
Good action items are:
- Specific: Clear what needs to be done
- Assigned: One owner (not "the team")
- Time-bound: Due date specified
- Measurable: Can verify completion
Bad action items:
- "Be more careful"
- "Improve monitoring"
- "Train people better"
Good action items:
- "Implement alerting for model accuracy below 85% by [date] - Owner: [Name]"
- "Add input validation for [specific field] by [date] - Owner: [Name]"
- "Update runbook section 3.4 to include [procedure] by [date] - Owner: [Name]"
Tracking:
- Review action items weekly until complete
- Report on completion to management
- Don't close post-mortem until all critical actions are done
Step 5: Share Learnings
Internal sharing:
- Post to incident learning repository
- Brief relevant teams
- Include in regular safety/quality meetings
- Update training materials
Consider sharing:
- Across departments facing similar AI risks
- At organizational learning forums
- Anonymized sharing in industry groups (if appropriate)
Post-Mortem Templates
Template 1: Standard Post-Mortem (Medium/High Severity)
[Full template provided above]
Template 2: Brief Post-Mortem (Low Severity / Near-Miss)
BRIEF POST-MORTEM
Incident: [One-line description]
Date: [Date]
Severity: [Level]
Reviewed by: [Names]
WHAT HAPPENED
[2-3 sentences]
ROOT CAUSE
[One sentence]
KEY LEARNING
[One sentence]
ACTION ITEM
| Action | Owner | Due |
|--------|-------|-----|
| [Action] | [Name] | [Date] |
No detailed post-mortem required because: [Reason]
Template 3: Major Incident Post-Mortem (Critical Severity)
[Use standard template with additions:]
ADDITIONAL SECTIONS FOR MAJOR INCIDENTS
EXTERNAL COMMUNICATION REVIEW
- What we communicated: [Summary]
- What worked: [Items]
- What could improve: [Items]
REGULATORY INTERACTION REVIEW
- Notifications made: [List]
- Regulator response: [Summary]
- Lessons for future notifications: [Items]
RECOVERY ASSESSMENT
- Recovery time: [Duration]
- Recovery completeness: [Assessment]
- Recovery gaps: [Items]
COST ANALYSIS
- Direct costs: [Amount]
- Indirect costs: [Amount]
- Opportunity costs: [Amount]
- Potential avoided costs (if improvements made): [Amount]
EXECUTIVE SUMMARY
[One-page summary suitable for board/executive distribution]
Discussion Questions for Post-Mortems
On Detection
- How was the incident detected?
- Could we have detected it earlier?
- What monitoring was in place? What was missing?
- Did alerts fire? Were they actionable?
On Response
- Did we have the right people engaged?
- Was escalation appropriate?
- Were procedures followed? Were they adequate?
- What slowed us down?
On Containment
- How quickly was containment achieved?
- Was the containment approach appropriate?
- What tools or access did we lack?
- Could we contain faster next time?
On Communication
- Did the right people know what was happening?
- Was communication timely and clear?
- Were stakeholders appropriately informed?
- What communication gaps existed?
On Root Cause
- Have we found the true root cause or just symptoms?
- Why did our controls fail to prevent this?
- What systemic issues contributed?
- Have we seen similar issues before?
On Prevention
- What would have prevented this incident?
- What changes would reduce similar risks?
- Are we treating symptoms or causes?
- How do we know our fixes will work?
Common Failure Modes
1. Blame Culture
People don't speak honestly because they fear consequences. Solution: Explicit blameless principles, leadership modeling, separating post-mortems from performance evaluation.
2. Surface Analysis
Stopping at the obvious cause without digging deeper. Solution: Use structured root cause techniques, ask "why" repeatedly.
3. Action Item Graveyard
Items identified but never implemented. Solution: Track completion, escalate delays, tie to regular work planning.
4. Wrong Attendees
Missing key perspectives or including too many people. Solution: Thoughtful attendee selection, keep groups focused.
5. Rushed Sessions
Not allowing enough time for thorough discussion. Solution: Protect 90-120 minutes, don't shortcut.
6. No Follow-Through
Post-mortem happens but learnings aren't disseminated. Solution: Required sharing, learning repositories, training updates.
7. Skipping Post-Mortems
Pressure to move on. Solution: Make post-mortems mandatory for qualifying incidents, schedule them automatically.
Implementation Checklist
Building the Capability
- Define post-mortem criteria (when required)
- Create templates and procedures
- Train facilitators
- Establish action tracking mechanism
- Create learning sharing channels
- Gain leadership commitment to blameless culture
For Each Post-Mortem
- Schedule within 3-7 days of incident closure
- Distribute pre-read materials
- Facilitate session with ground rules
- Document findings completely
- Assign action items with owners and dates
- Track action completion
- Share learnings
- Close when all critical actions complete
Metrics to Track
| Metric | Target | Purpose |
|---|---|---|
| Post-mortem completion rate | 100% for qualifying incidents | Ensure reviews happen |
| Time to post-mortem | 3-7 days | Ensure timely review |
| Action item completion rate | >90% on time | Ensure follow-through |
| Recurrence rate | <10% within 12 months | Measure effectiveness |
| Lessons shared | 100% to relevant audiences | Ensure learning spreads |
Frequently Asked Questions
How do we maintain a blameless culture?
Leadership must model it. Separate post-mortems from performance evaluation. Focus language on systems and processes, not individuals. Call out blame when it occurs. Celebrate honest disclosure.
What if we don't know the root cause?
Document uncertainty. "We believe X, but cannot confirm" is acceptable. Implement actions based on best understanding. Schedule follow-up if more investigation needed.
How do we handle repeat incidents?
If an incident recurs, the post-mortem should examine why previous actions didn't prevent it. Were actions completed? Were they insufficient? Were there new contributing factors?
Should post-mortems be public within the organization?
Yes, with appropriate sensitivity. Sharing builds organizational learning. Redact truly sensitive details if needed, but default to transparency.
Who should facilitate post-mortems?
Someone not directly involved in the incident if possible. The facilitator needs to maintain neutrality and focus on learning rather than defending actions.
How long should a post-mortem session take?
90-120 minutes for medium/high severity incidents. Brief reviews (30-45 minutes) for minor incidents. Major incidents may need multiple sessions.
What if people disagree about root cause?
Capture different perspectives in the document. If disagreement is significant, it may indicate insufficient investigation. Additional analysis may be needed.
Taking Action
Post-mortems are where incidents become learning. The organizations that improve their AI systems fastest are those that treat every incident as an opportunity to get better.
Don't skip post-mortems. Don't rush them. And most importantly—don't let action items die in a document. Follow through until improvements are real.
Ready to build effective AI incident learning processes?
Pertama Partners helps organizations establish post-mortem practices that drive genuine improvement. Our AI Readiness Audit includes incident response and continuous improvement assessment.
References
- Google SRE. (2024). Postmortem Culture: Learning from Failure.
- Etsy. (2024). Debriefing Facilitation Guide.
- Dekker, S. (2014). The Field Guide to Understanding 'Human Error'.
- Reason, J. (1997). Managing the Risks of Organizational Accidents.
- Hollnagel, E. (2014). Safety-I and Safety-II: The Past and Future of Safety Management.
Frequently Asked Questions
Focus on learning, not blame. Document what happened, why it happened, what worked in response, what didn't, and specific improvements. Assign owners and timelines for action items.
Establish ground rules that focus on systems and processes, not individuals. Assume people made reasonable decisions with available information. Look for systemic improvements.
Include incident timeline, root causes, contributing factors, impact assessment, response effectiveness evaluation, and specific action items with owners and deadlines.
References
- Google SRE. (2024). *Postmortem Culture: Learning from Failure*.. Google SRE *Postmortem Culture Learning from Failure* (2024)
- Etsy. (2024). *Debriefing Facilitation Guide*.. Etsy *Debriefing Facilitation Guide* (2024)
- Dekker, S. (2014). *The Field Guide to Understanding 'Human Error'*.. Dekker S *The Field Guide to Understanding 'Human Error'* (2014)
- Reason, J. (1997). *Managing the Risks of Organi. Reason J *Managing the Risks of Organi (1997)

