Back to Insights
AI Incident Response & MonitoringFrameworkPractitioner

AI Incident Post-Mortem: Templates and Best Practices

November 25, 20259 min readMichael Lansdowne Hauge
For:IT LeadersRisk ManagersAI Project ManagersOperations Directors

How to conduct effective AI incident post-mortems that drive improvement. Includes facilitation guide, templates, and strategies for blameless learning.

Tech Code Review - ai incident response & monitoring insights

Key Takeaways

  • 1.Conduct thorough post-incident analysis to prevent recurrence
  • 2.Structure post-mortems to extract actionable learnings
  • 3.Create blameless culture that encourages honest incident review
  • 4.Document root causes and contributing factors systematically
  • 5.Turn incident insights into improved processes and controls

The investigation is complete. The root cause is understood. Now what?

A post-mortem review is where incident response transforms into organizational learning. Done well, it prevents recurrence and improves capability. Done poorly—or skipped entirely—it guarantees you'll face the same problems again.

This guide provides a practical framework for conducting AI incident post-mortems that drive genuine improvement.


Executive Summary

  • Post-mortems are for learning, not blame: Create psychological safety for honest discussion
  • Structure enables consistency: Use a standard format so post-mortems are comparable and complete
  • Action items must be tracked: Insights without implementation are worthless
  • Timing matters: Too soon and emotions interfere; too late and memory fades
  • Include the right people: Those who responded, those who will prevent, and those who can authorize changes
  • Share learnings: Organizational learning requires dissemination beyond the immediate team
  • Follow through: Post-mortems are only valuable if recommendations are implemented

Why This Matters Now

Most organizations skip or shortcut post-mortems. The incident is over, everyone's tired, and there's pressure to move on. This is a mistake.

Without effective post-mortems:

  • The same incidents keep happening
  • Teams don't learn from each other's experiences
  • Root causes go unaddressed
  • Confidence in AI systems erodes

Organizations that invest in post-mortems build resilience. They don't just respond to incidents—they prevent them.


When to Conduct a Post-Mortem

Always conduct post-mortems for:

  • Critical severity incidents
  • High severity incidents
  • Any incident requiring regulatory notification
  • Incidents with external impact
  • Incidents revealing systemic issues
  • Novel incident types

Consider post-mortems for:

  • Medium severity incidents
  • Near-misses that could have been serious
  • Incidents with valuable learning potential
  • Request from team members

May skip formal post-mortems for:

  • Low severity, routine incidents with known causes
  • Incidents already covered by recent similar post-mortems

Post-Mortem Process

Step 1: Schedule and Prepare

Timing: 3-7 days after incident closure

  • Soon enough that memory is fresh
  • Late enough that emotions have settled

Attendees:

  • Incident response team members
  • System owners/operators
  • Relevant technical experts
  • Management stakeholder (to authorize changes)
  • Facilitator (ideally not involved in incident)

Preparation:

  • Distribute investigation report in advance
  • Gather timeline and key facts
  • Send pre-read to all participants
  • Reserve 90-120 minutes

Step 2: Facilitate the Session

Ground Rules (Facilitator sets these at the start)

  1. Blameless: We're here to improve systems, not assign blame
  2. Assume good intentions: Everyone did what made sense to them at the time
  3. Focus on facts: What happened, not what we imagine happened
  4. Seek to understand: Ask questions before judging
  5. All perspectives welcome: Everyone's view adds value
  6. Constructive only: Criticize problems, not people

Agenda:

TimeTopicPurpose
10 minContext settingReview incident summary, set ground rules
20 minTimeline reviewWalk through what happened
20 minWhat went wellIdentify what worked in response
30 minWhat went wrongIdentify failures and contributing factors
20 minImprovement actionsDevelop specific, actionable improvements
10 minWrap-upConfirm action items, assign owners

Step 3: Document Findings

Post-Mortem Report Template

AI INCIDENT POST-MORTEM

Incident ID: [ID]
Date of Incident: [Date]
Date of Post-Mortem: [Date]
Facilitator: [Name]
Attendees: [Names and roles]

INCIDENT SUMMARY
[2-3 paragraph summary of what happened]

TIMELINE
[Key events with timestamps]
- [Time]: [Event]
- [Time]: [Event]

IMPACT
- Users affected: [Number]
- Duration: [Time]
- Business impact: [Description]
- Other impacts: [Description]

ROOT CAUSES
1. [Primary root cause]
   - Contributing factor: [Detail]
   - Contributing factor: [Detail]

2. [Secondary root cause if applicable]
   - Contributing factor: [Detail]

WHAT WENT WELL
- [Item 1]
- [Item 2]
- [Item 3]

WHAT WENT WRONG
- [Item 1]: [Description and impact]
- [Item 2]: [Description and impact]
- [Item 3]: [Description and impact]

WHERE WE GOT LUCKY
[Things that could have made this worse but didn't]
- [Item 1]

LESSONS LEARNED
1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]

ACTION ITEMS
| # | Action | Owner | Due Date | Status |
|---|--------|-------|----------|--------|
| 1 | [Specific action] | [Name] | [Date] | Open |
| 2 | [Specific action] | [Name] | [Date] | Open |
| 3 | [Specific action] | [Name] | [Date] | Open |

METRICS TO TRACK
- [Metric that would indicate improvement]
- [Metric that would indicate recurrence]

FOLLOW-UP
- Next review date: [Date]
- Distribution: [Who receives this document]

Step 4: Track Action Items

Good action items are:

  • Specific: Clear what needs to be done
  • Assigned: One owner (not "the team")
  • Time-bound: Due date specified
  • Measurable: Can verify completion

Bad action items:

  • "Be more careful"
  • "Improve monitoring"
  • "Train people better"

Good action items:

  • "Implement alerting for model accuracy below 85% by [date] - Owner: [Name]"
  • "Add input validation for [specific field] by [date] - Owner: [Name]"
  • "Update runbook section 3.4 to include [procedure] by [date] - Owner: [Name]"

Tracking:

  • Review action items weekly until complete
  • Report on completion to management
  • Don't close post-mortem until all critical actions are done

Step 5: Share Learnings

Internal sharing:

  • Post to incident learning repository
  • Brief relevant teams
  • Include in regular safety/quality meetings
  • Update training materials

Consider sharing:

  • Across departments facing similar AI risks
  • At organizational learning forums
  • Anonymized sharing in industry groups (if appropriate)

Post-Mortem Templates

Template 1: Standard Post-Mortem (Medium/High Severity)

[Full template provided above]

Template 2: Brief Post-Mortem (Low Severity / Near-Miss)

BRIEF POST-MORTEM

Incident: [One-line description]
Date: [Date]
Severity: [Level]
Reviewed by: [Names]

WHAT HAPPENED
[2-3 sentences]

ROOT CAUSE
[One sentence]

KEY LEARNING
[One sentence]

ACTION ITEM
| Action | Owner | Due |
|--------|-------|-----|
| [Action] | [Name] | [Date] |

No detailed post-mortem required because: [Reason]

Template 3: Major Incident Post-Mortem (Critical Severity)

[Use standard template with additions:]

ADDITIONAL SECTIONS FOR MAJOR INCIDENTS

EXTERNAL COMMUNICATION REVIEW
- What we communicated: [Summary]
- What worked: [Items]
- What could improve: [Items]

REGULATORY INTERACTION REVIEW
- Notifications made: [List]
- Regulator response: [Summary]
- Lessons for future notifications: [Items]

RECOVERY ASSESSMENT
- Recovery time: [Duration]
- Recovery completeness: [Assessment]
- Recovery gaps: [Items]

COST ANALYSIS
- Direct costs: [Amount]
- Indirect costs: [Amount]
- Opportunity costs: [Amount]
- Potential avoided costs (if improvements made): [Amount]

EXECUTIVE SUMMARY
[One-page summary suitable for board/executive distribution]

Discussion Questions for Post-Mortems

On Detection

  • How was the incident detected?
  • Could we have detected it earlier?
  • What monitoring was in place? What was missing?
  • Did alerts fire? Were they actionable?

On Response

  • Did we have the right people engaged?
  • Was escalation appropriate?
  • Were procedures followed? Were they adequate?
  • What slowed us down?

On Containment

  • How quickly was containment achieved?
  • Was the containment approach appropriate?
  • What tools or access did we lack?
  • Could we contain faster next time?

On Communication

  • Did the right people know what was happening?
  • Was communication timely and clear?
  • Were stakeholders appropriately informed?
  • What communication gaps existed?

On Root Cause

  • Have we found the true root cause or just symptoms?
  • Why did our controls fail to prevent this?
  • What systemic issues contributed?
  • Have we seen similar issues before?

On Prevention

  • What would have prevented this incident?
  • What changes would reduce similar risks?
  • Are we treating symptoms or causes?
  • How do we know our fixes will work?

Common Failure Modes

1. Blame Culture

People don't speak honestly because they fear consequences. Solution: Explicit blameless principles, leadership modeling, separating post-mortems from performance evaluation.

2. Surface Analysis

Stopping at the obvious cause without digging deeper. Solution: Use structured root cause techniques, ask "why" repeatedly.

3. Action Item Graveyard

Items identified but never implemented. Solution: Track completion, escalate delays, tie to regular work planning.

4. Wrong Attendees

Missing key perspectives or including too many people. Solution: Thoughtful attendee selection, keep groups focused.

5. Rushed Sessions

Not allowing enough time for thorough discussion. Solution: Protect 90-120 minutes, don't shortcut.

6. No Follow-Through

Post-mortem happens but learnings aren't disseminated. Solution: Required sharing, learning repositories, training updates.

7. Skipping Post-Mortems

Pressure to move on. Solution: Make post-mortems mandatory for qualifying incidents, schedule them automatically.


Implementation Checklist

Building the Capability

  • Define post-mortem criteria (when required)
  • Create templates and procedures
  • Train facilitators
  • Establish action tracking mechanism
  • Create learning sharing channels
  • Gain leadership commitment to blameless culture

For Each Post-Mortem

  • Schedule within 3-7 days of incident closure
  • Distribute pre-read materials
  • Facilitate session with ground rules
  • Document findings completely
  • Assign action items with owners and dates
  • Track action completion
  • Share learnings
  • Close when all critical actions complete

Metrics to Track

MetricTargetPurpose
Post-mortem completion rate100% for qualifying incidentsEnsure reviews happen
Time to post-mortem3-7 daysEnsure timely review
Action item completion rate>90% on timeEnsure follow-through
Recurrence rate<10% within 12 monthsMeasure effectiveness
Lessons shared100% to relevant audiencesEnsure learning spreads

Frequently Asked Questions

How do we maintain a blameless culture?

Leadership must model it. Separate post-mortems from performance evaluation. Focus language on systems and processes, not individuals. Call out blame when it occurs. Celebrate honest disclosure.

What if we don't know the root cause?

Document uncertainty. "We believe X, but cannot confirm" is acceptable. Implement actions based on best understanding. Schedule follow-up if more investigation needed.

How do we handle repeat incidents?

If an incident recurs, the post-mortem should examine why previous actions didn't prevent it. Were actions completed? Were they insufficient? Were there new contributing factors?

Should post-mortems be public within the organization?

Yes, with appropriate sensitivity. Sharing builds organizational learning. Redact truly sensitive details if needed, but default to transparency.

Who should facilitate post-mortems?

Someone not directly involved in the incident if possible. The facilitator needs to maintain neutrality and focus on learning rather than defending actions.

How long should a post-mortem session take?

90-120 minutes for medium/high severity incidents. Brief reviews (30-45 minutes) for minor incidents. Major incidents may need multiple sessions.

What if people disagree about root cause?

Capture different perspectives in the document. If disagreement is significant, it may indicate insufficient investigation. Additional analysis may be needed.


Taking Action

Post-mortems are where incidents become learning. The organizations that improve their AI systems fastest are those that treat every incident as an opportunity to get better.

Don't skip post-mortems. Don't rush them. And most importantly—don't let action items die in a document. Follow through until improvements are real.

Ready to build effective AI incident learning processes?

Pertama Partners helps organizations establish post-mortem practices that drive genuine improvement. Our AI Readiness Audit includes incident response and continuous improvement assessment.

Book an AI Readiness Audit →


References

  1. Google SRE. (2024). Postmortem Culture: Learning from Failure.
  2. Etsy. (2024). Debriefing Facilitation Guide.
  3. Dekker, S. (2014). The Field Guide to Understanding 'Human Error'.
  4. Reason, J. (1997). Managing the Risks of Organizational Accidents.
  5. Hollnagel, E. (2014). Safety-I and Safety-II: The Past and Future of Safety Management.

Frequently Asked Questions

Focus on learning, not blame. Document what happened, why it happened, what worked in response, what didn't, and specific improvements. Assign owners and timelines for action items.

Establish ground rules that focus on systems and processes, not individuals. Assume people made reasonable decisions with available information. Look for systemic improvements.

Include incident timeline, root causes, contributing factors, impact assessment, response effectiveness evaluation, and specific action items with owners and deadlines.

References

  1. Google SRE. (2024). *Postmortem Culture: Learning from Failure*.. Google SRE *Postmortem Culture Learning from Failure* (2024)
  2. Etsy. (2024). *Debriefing Facilitation Guide*.. Etsy *Debriefing Facilitation Guide* (2024)
  3. Dekker, S. (2014). *The Field Guide to Understanding 'Human Error'*.. Dekker S *The Field Guide to Understanding 'Human Error'* (2014)
  4. Reason, J. (1997). *Managing the Risks of Organi. Reason J *Managing the Risks of Organi (1997)
Michael Lansdowne Hauge

Founder & Managing Partner

Founder & Managing Partner at Pertama Partners. Founder of Pertama Group.

post-mortemincident reviewlessons learnedcontinuous improvementblameless cultureAI post-mortem templateblameless incident review AIlessons learned AI incidentsincident retrospective best practicescontinuous improvement incident response

Ready to Apply These Insights to Your Organization?

Book a complimentary AI Readiness Audit to identify opportunities specific to your context.

Book an AI Readiness Audit