AI Incident Investigation: A Step-by-Step Guide

Q: How do I investigate AI incidents systematically?

Preserve evidence first, document the incident timeline, identify root causes versus symptoms, interview involved parties, analyze technical logs, and coordinate across technical and business teams.

Q: What evidence should be preserved in AI incidents?

Preserve model versions, input data, outputs, configuration, logs, user reports, and any modifications made during response. Maintain chain of custody for potential legal needs.

Q: How do I identify root causes in AI failures?

Look beyond immediate technical failures to training data issues, integration problems, operational practices, and governance gaps. Use techniques like "5 Whys" adapted for AI systems.

The immediate crisis is contained. Now comes the harder work: figuring out what actually happened, why it happened, and what needs to change so it doesn't happen again.

AI incident investigation differs from traditional IT forensics. AI systems are less deterministic, their failures are often subtle, and understanding what went wrong may require specialized expertise. This guide provides a structured methodology for investigating AI incidents thoroughly.

Executive Summary

AI investigation has unique challenges: Non-deterministic systems, complex causation, black-box behavior. Evidence preservation is critical: AI state, inputs, outputs, and logs must be captured before they're lost. Root cause analysis requires AI expertise: Understanding model failures needs specialized knowledge. Investigation scope must balance depth with speed: Don't delay remediation for perfect understanding. Third-party AI complicates investigation: Vendor access and cooperation may be needed. Documentation serves multiple purposes: Regulatory compliance, legal protection, organizational learning. Investigation feeds improvement: The goal isn't blame but prevention.

Why This Matters Now

Investigations often get short-changed. Once an incident is contained, there's pressure to move on. But inadequate investigation leads to:

Recurrence: The same incident happens again because root cause wasn't addressed. Regulatory problems: Authorities expect thorough investigation and documentation. Legal exposure: Inadequate investigation makes defense harder if litigation occurs. Lost learning: The organization doesn't improve its AI practices. Hidden problems: Related issues go undetected.

Thorough investigation is an investment, not a cost.

AI Investigation Challenges

Challenge 1: Non-Determinism

AI systems can produce different outputs from the same inputs. Reproducing the exact failure conditions may be impossible.

Approach: Document the statistical behavior, not just individual outputs. Look for patterns across multiple instances.

Challenge 2: Black Box Behavior

Many AI models can't explain why they produced specific outputs. The internal reasoning is opaque.

Approach: Use explainability techniques where available. Focus on what conditions correlate with failures, even if causation is unclear.

Challenge 3: Complex Causation

AI failures often result from multiple interacting factors, data, model, implementation, context, not a single root cause.

Approach: Use multiple root cause analysis techniques. Accept that causation may be multifactorial.

Challenge 4: Temporal State

The model's behavior may have changed since the incident, through drift, updates, or retraining.

Approach: Preserve model state immediately. Document version information. Compare current state to incident-time state.

Challenge 5: Third-Party Systems

If the AI is vendor-provided, you may lack access to investigate internal behavior.

Approach: Engage vendors early. Contractual provisions for incident cooperation are essential. Focus on what you can observe.

AI Incident Investigation Process

Phase 1: Evidence Preservation

Objective: Secure evidence before it's lost or altered

Timing: Immediately upon incident detection, parallel to containment

Evidence Type	What to Preserve	How to Preserve
Model state	Model version, weights (if available), configuration	Snapshot, documentation
Input data	Inputs that triggered the incident	Copy to secure location
Output data	Outputs produced during incident	Export and secure
System logs	Application, system, security logs	Export with timestamps
Access logs	Who accessed what when	Export and secure
Configuration	System settings at time of incident	Snapshot
Metrics data	Performance metrics, monitoring data	Export from monitoring systems
Related data	Training data, feature data, context	Secure if relevant

Evidence Chain of Custody

Document for each piece of evidence: What was collected. When it was collected. Who collected it. Where it's stored. Integrity verification (hashes).

Phase 2: Initial Scoping

Objective: Define investigation boundaries

Question	Purpose
What AI system(s) are involved?	Scope technical investigation
What is the incident timeline?	Focus investigation period
Who might have relevant information?	Plan interviews
What documentation exists?	Identify available evidence
What is the business impact?	Prioritize investigation depth
Are there regulatory implications?	Ensure compliance requirements met
Is there potential litigation?	Engage legal early if needed

Scope Document

Investigation Scope Document

Incident ID: [ID]
Investigation Lead: [Name]
Date: [Date]

SCOPE
Systems: [List AI systems in scope]. Time period: [Start] to [End]. Data: [Types of data in scope]. People: [Roles/individuals to interview].

OUT OF SCOPE
[Items explicitly excluded].

OBJECTIVES
Determine root cause of incident. Assess full impact. Identify remediation requirements. Document for regulatory/legal purposes. Extract lessons learned.

CONSTRAINTS
Investigation deadline: [Date]. Resource constraints: [If any]. Access limitations: [If any].

Phase 3: Information Gathering

Objective: Collect all relevant information

Technical Analysis

Activity	Description	Output
Log analysis	Review system, application, and security logs	Timeline, anomalies identified
Model analysis	Examine model behavior, performance metrics	Model assessment
Data analysis	Analyze inputs, outputs, and related data	Data patterns, anomalies
System analysis	Review configuration, architecture, integrations	System state documentation
Code review	Review relevant code if applicable	Code issues identified

Interviews

Interviewee	Purpose	Sample Questions
First responders	Understand initial discovery and response	What did you observe? What actions did you take?
System operators	Understand normal operations and deviations	Was anything unusual before the incident?
AI/ML engineers	Technical understanding of the system	How should the system behave? What could cause this?
Business users	Business impact and context	What was the real-world effect?
Security team	Security context	Any related security events?

Document Review

System documentation. Previous incident reports. Change records (recent changes to the system). Monitoring alerts and reports. Training data documentation. Model validation reports.

Phase 4: Root Cause Analysis

Objective: Determine what caused the incident and why

Technique 1: 5 Whys

Keep asking "why" until you reach fundamental causes:

Incident: AI chatbot provided incorrect information to customers

Why? → The model generated a response containing false facts
Why? → The model was not trained on recent policy changes 
Why? → The retraining pipeline failed two months ago
Why? → Pipeline failure alerts went to a deprecated email address
Why? → Alert configuration wasn't updated during team reorganization

ROOT CAUSE: Alert configuration management process inadequate

Technique 2: Fishbone (Ishikawa) Diagram

Categorize potential causes:

Fishbone (Ishikawa) diagram: categorize potential causes across Data, Model, Process, People, and Systems.

Technique 3: Fault Tree Analysis

Work backward from the incident:

Incident (Top Event)
 │
 ├── Immediate Cause 1
 │ │
 │ ├── Contributing Factor 1a
 │ └── Contributing Factor 1b
 │
 └── Immediate Cause 2
 │
 ├── Contributing Factor 2a
 └── Contributing Factor 2b

AI-Specific Root Cause Categories

Category	Examples
Data issues	Data drift, poisoned data, data quality, missing data, biased data
Model issues	Model drift, training problems, architectural limitations, overfitting
Implementation issues	Integration bugs, configuration errors, deployment problems
Operational issues	Monitoring gaps, inadequate thresholds, response failures
Governance issues	Policy gaps, unapproved changes, inadequate oversight
External factors	Adversarial attacks, changed operating environment, third-party failures

Phase 5: Impact Assessment

Objective: Understand full scope of incident impact

Impact Dimension	Assessment Questions	Quantification
People affected	How many? Who?	Count, demographics
Data compromised	What types? How sensitive?	Data classification
Financial	Direct costs? Indirect costs?	Dollar amounts
Operational	Business disruption? Duration?	Downtime, affected processes
Reputational	Public awareness? Media?	Coverage, sentiment
Regulatory	Compliance violations? Notifications?	Specific requirements triggered
Legal	Liability exposure?	Potential claims

Phase 6: Documentation

Objective: Create complete investigation record

Investigation Report Structure

AI INCIDENT INVESTIGATION REPORT

EXECUTIVE SUMMARY. Incident overview. Key findings. Root causes. Recommendations.

INCIDENT DESCRIPTION. Timeline. Systems involved. Detection method. Initial response.

INVESTIGATION METHODOLOGY. Scope. Team. Methods used. Limitations.

FINDINGS. Technical findings. Process findings. People findings. Third-party findings.

ROOT CAUSE ANALYSIS. Primary root cause. Contributing factors. Analysis methodology.

IMPACT ASSESSMENT. Quantified impacts. Stakeholders affected. Regulatory implications.

RECOMMENDATIONS. Immediate actions. Short-term improvements. Long-term improvements.

LESSONS LEARNED. What worked well. What didn't work. Key takeaways.

APPENDICES. Evidence inventory. Interview summaries. Technical analysis details. Timeline.

AI Incident Investigation Checklist

Day 1 (Preservation)

[ ] Preserve model state (version, config, weights if accessible). [ ] Export relevant logs. [ ] Capture input/output data. [ ] Document system state. [ ] Establish chain of custody. [ ] Identify key stakeholders.

Week 1 (Core Investigation)

[ ] Define investigation scope. [ ] Conduct technical analysis. [ ] Complete interviews. [ ] Review documentation. [ ] Begin root cause analysis. [ ] Assess impact.

Week 2+ (Analysis and Reporting)

[ ] Complete root cause analysis. [ ] Develop recommendations. [ ] Draft investigation report. [ ] Review with stakeholders. [ ] Finalize documentation. [ ] Transfer to post-mortem process.

Common Failure Modes

1. Starting Late

Investigation starts after evidence is lost. Begin preservation immediately.

2. Too Narrow Focus

Investigating only the obvious cause while missing systemic issues. Look broadly.

3. Blame-Seeking

Investigation becomes about finding fault rather than understanding and preventing.

4. Stopping at Symptoms

Accepting surface explanations without digging to root causes.

5. Inadequate Documentation

Verbal findings that can't be referenced later. Document everything.

6. No Follow-Through

Recommendations made but never implemented. Track recommendation completion.

Metrics to Track

Metric	Target
Investigation initiation	Within 24 hours of containment
Investigation completion	Within 2-4 weeks for significant incidents
Root causes identified	At least 1 per incident
Recommendations made	At least 1 per root cause
Recommendation implementation	>80% within 90 days
Recurrence rate	<10% of same incident type within 12 months

Taking Action

Thorough investigation is what separates organizations that keep having the same incidents from those that genuinely improve. The time invested in understanding what went wrong pays dividends in incidents prevented.

Build investigation capability before you need it, trained people, documented procedures, and preserved access to forensic information.

Ready to strengthen your AI incident investigation capability?

Pertama Partners helps organizations build robust AI incident investigation processes. Our AI Readiness Audit includes incident response and investigation capability assessment.

Book an AI Readiness Audit →

Common Questions