Back to Insights
AI Incident Response & MonitoringGuidePractitioner

AI Incident Investigation: A Step-by-Step Guide

November 24, 202510 min readMichael Lansdowne Hauge
For:IT LeadersSecurity OfficersAI Project ManagersData Scientists

Structured methodology for investigating AI incidents. Covers evidence preservation, root cause analysis techniques, and investigation documentation requirements.

Healthcare Medical Lab - ai incident response & monitoring insights

Key Takeaways

  • 1.Follow systematic investigation process for AI incidents
  • 2.Gather evidence and document findings properly
  • 3.Identify root causes versus symptoms in AI failures
  • 4.Coordinate investigation across technical and business teams
  • 5.Preserve evidence for compliance and legal requirements

The immediate crisis is contained. Now comes the harder work: figuring out what actually happened, why it happened, and what needs to change so it doesn't happen again.

AI incident investigation differs from traditional IT forensics. AI systems are less deterministic, their failures are often subtle, and understanding what went wrong may require specialized expertise. This guide provides a structured methodology for investigating AI incidents thoroughly.


Executive Summary

  • AI investigation has unique challenges: Non-deterministic systems, complex causation, black-box behavior
  • Evidence preservation is critical: AI state, inputs, outputs, and logs must be captured before they're lost
  • Root cause analysis requires AI expertise: Understanding model failures needs specialized knowledge
  • Investigation scope must balance depth with speed: Don't delay remediation for perfect understanding
  • Third-party AI complicates investigation: Vendor access and cooperation may be needed
  • Documentation serves multiple purposes: Regulatory compliance, legal protection, organizational learning
  • Investigation feeds improvement: The goal isn't blame but prevention

Why This Matters Now

Investigations often get short-changed. Once an incident is contained, there's pressure to move on. But inadequate investigation leads to:

  • Recurrence: The same incident happens again because root cause wasn't addressed
  • Regulatory problems: Authorities expect thorough investigation and documentation
  • Legal exposure: Inadequate investigation makes defense harder if litigation occurs
  • Lost learning: The organization doesn't improve its AI practices
  • Hidden problems: Related issues go undetected

Thorough investigation is an investment, not a cost.


AI Investigation Challenges

Challenge 1: Non-Determinism

AI systems can produce different outputs from the same inputs. Reproducing the exact failure conditions may be impossible.

Approach: Document the statistical behavior, not just individual outputs. Look for patterns across multiple instances.

Challenge 2: Black Box Behavior

Many AI models can't explain why they produced specific outputs. The internal reasoning is opaque.

Approach: Use explainability techniques where available. Focus on what conditions correlate with failures, even if causation is unclear.

Challenge 3: Complex Causation

AI failures often result from multiple interacting factors—data, model, implementation, context—not a single root cause.

Approach: Use multiple root cause analysis techniques. Accept that causation may be multifactorial.

Challenge 4: Temporal State

The model's behavior may have changed since the incident—through drift, updates, or retraining.

Approach: Preserve model state immediately. Document version information. Compare current state to incident-time state.

Challenge 5: Third-Party Systems

If the AI is vendor-provided, you may lack access to investigate internal behavior.

Approach: Engage vendors early. Contractual provisions for incident cooperation are essential. Focus on what you can observe.


AI Incident Investigation Process

Phase 1: Evidence Preservation

Objective: Secure evidence before it's lost or altered

Timing: Immediately upon incident detection, parallel to containment

Evidence TypeWhat to PreserveHow to Preserve
Model stateModel version, weights (if available), configurationSnapshot, documentation
Input dataInputs that triggered the incidentCopy to secure location
Output dataOutputs produced during incidentExport and secure
System logsApplication, system, security logsExport with timestamps
Access logsWho accessed what whenExport and secure
ConfigurationSystem settings at time of incidentSnapshot
Metrics dataPerformance metrics, monitoring dataExport from monitoring systems
Related dataTraining data, feature data, contextSecure if relevant

Evidence Chain of Custody

Document for each piece of evidence:

  • What was collected
  • When it was collected
  • Who collected it
  • Where it's stored
  • Integrity verification (hashes)

Phase 2: Initial Scoping

Objective: Define investigation boundaries

QuestionPurpose
What AI system(s) are involved?Scope technical investigation
What is the incident timeline?Focus investigation period
Who might have relevant information?Plan interviews
What documentation exists?Identify available evidence
What is the business impact?Prioritize investigation depth
Are there regulatory implications?Ensure compliance requirements met
Is there potential litigation?Engage legal early if needed

Scope Document

Investigation Scope Document

Incident ID: [ID]
Investigation Lead: [Name]
Date: [Date]

SCOPE
- Systems: [List AI systems in scope]
- Time period: [Start] to [End]
- Data: [Types of data in scope]
- People: [Roles/individuals to interview]

OUT OF SCOPE
- [Items explicitly excluded]

OBJECTIVES
1. Determine root cause of incident
2. Assess full impact
3. Identify remediation requirements
4. Document for regulatory/legal purposes
5. Extract lessons learned

CONSTRAINTS
- Investigation deadline: [Date]
- Resource constraints: [If any]
- Access limitations: [If any]

Phase 3: Information Gathering

Objective: Collect all relevant information

Technical Analysis

ActivityDescriptionOutput
Log analysisReview system, application, and security logsTimeline, anomalies identified
Model analysisExamine model behavior, performance metricsModel assessment
Data analysisAnalyze inputs, outputs, and related dataData patterns, anomalies
System analysisReview configuration, architecture, integrationsSystem state documentation
Code reviewReview relevant code if applicableCode issues identified

Interviews

IntervieweePurposeSample Questions
First respondersUnderstand initial discovery and responseWhat did you observe? What actions did you take?
System operatorsUnderstand normal operations and deviationsWas anything unusual before the incident?
AI/ML engineersTechnical understanding of the systemHow should the system behave? What could cause this?
Business usersBusiness impact and contextWhat was the real-world effect?
Security teamSecurity contextAny related security events?

Document Review

  • System documentation
  • Previous incident reports
  • Change records (recent changes to the system)
  • Monitoring alerts and reports
  • Training data documentation
  • Model validation reports

Phase 4: Root Cause Analysis

Objective: Determine what caused the incident and why

Technique 1: 5 Whys

Keep asking "why" until you reach fundamental causes:

Incident: AI chatbot provided incorrect information to customers

Why? → The model generated a response containing false facts
Why? → The model was not trained on recent policy changes  
Why? → The retraining pipeline failed two months ago
Why? → Pipeline failure alerts went to a deprecated email address
Why? → Alert configuration wasn't updated during team reorganization

ROOT CAUSE: Alert configuration management process inadequate

Technique 2: Fishbone (Ishikawa) Diagram

Categorize potential causes:

Fishbone (Ishikawa) diagram: categorize potential causes across Data, Model, Process, People, and Systems.

Technique 3: Fault Tree Analysis

Work backward from the incident:

Incident (Top Event)
       │
       ├── Immediate Cause 1
       │         │
       │         ├── Contributing Factor 1a
       │         └── Contributing Factor 1b
       │
       └── Immediate Cause 2
                 │
                 ├── Contributing Factor 2a
                 └── Contributing Factor 2b

AI-Specific Root Cause Categories

CategoryExamples
Data issuesData drift, poisoned data, data quality, missing data, biased data
Model issuesModel drift, training problems, architectural limitations, overfitting
Implementation issuesIntegration bugs, configuration errors, deployment problems
Operational issuesMonitoring gaps, inadequate thresholds, response failures
Governance issuesPolicy gaps, unapproved changes, inadequate oversight
External factorsAdversarial attacks, changed operating environment, third-party failures

Phase 5: Impact Assessment

Objective: Understand full scope of incident impact

Impact DimensionAssessment QuestionsQuantification
People affectedHow many? Who?Count, demographics
Data compromisedWhat types? How sensitive?Data classification
FinancialDirect costs? Indirect costs?Dollar amounts
OperationalBusiness disruption? Duration?Downtime, affected processes
ReputationalPublic awareness? Media?Coverage, sentiment
RegulatoryCompliance violations? Notifications?Specific requirements triggered
LegalLiability exposure?Potential claims

Phase 6: Documentation

Objective: Create complete investigation record

Investigation Report Structure

AI INCIDENT INVESTIGATION REPORT

1. EXECUTIVE SUMMARY
   - Incident overview
   - Key findings
   - Root causes
   - Recommendations

2. INCIDENT DESCRIPTION
   - Timeline
   - Systems involved
   - Detection method
   - Initial response

3. INVESTIGATION METHODOLOGY
   - Scope
   - Team
   - Methods used
   - Limitations

4. FINDINGS
   - Technical findings
   - Process findings
   - People findings
   - Third-party findings

5. ROOT CAUSE ANALYSIS
   - Primary root cause
   - Contributing factors
   - Analysis methodology

6. IMPACT ASSESSMENT
   - Quantified impacts
   - Stakeholders affected
   - Regulatory implications

7. RECOMMENDATIONS
   - Immediate actions
   - Short-term improvements
   - Long-term improvements

8. LESSONS LEARNED
   - What worked well
   - What didn't work
   - Key takeaways

9. APPENDICES
   - Evidence inventory
   - Interview summaries
   - Technical analysis details
   - Timeline

AI Incident Investigation Checklist

Day 1 (Preservation)

  • Preserve model state (version, config, weights if accessible)
  • Export relevant logs
  • Capture input/output data
  • Document system state
  • Establish chain of custody
  • Identify key stakeholders

Week 1 (Core Investigation)

  • Define investigation scope
  • Conduct technical analysis
  • Complete interviews
  • Review documentation
  • Begin root cause analysis
  • Assess impact

Week 2+ (Analysis and Reporting)

  • Complete root cause analysis
  • Develop recommendations
  • Draft investigation report
  • Review with stakeholders
  • Finalize documentation
  • Transfer to post-mortem process

Common Failure Modes

1. Starting Late

Investigation starts after evidence is lost. Begin preservation immediately.

2. Too Narrow Focus

Investigating only the obvious cause while missing systemic issues. Look broadly.

3. Blame-Seeking

Investigation becomes about finding fault rather than understanding and preventing.

4. Stopping at Symptoms

Accepting surface explanations without digging to root causes.

5. Inadequate Documentation

Verbal findings that can't be referenced later. Document everything.

6. No Follow-Through

Recommendations made but never implemented. Track recommendation completion.


Metrics to Track

MetricTarget
Investigation initiationWithin 24 hours of containment
Investigation completionWithin 2-4 weeks for significant incidents
Root causes identifiedAt least 1 per incident
Recommendations madeAt least 1 per root cause
Recommendation implementation>80% within 90 days
Recurrence rate<10% of same incident type within 12 months

Frequently Asked Questions

How much investigation is enough?

Proportional to impact. Minor incidents need brief investigation; major incidents need thorough analysis. When you can explain what happened, why, and what to change—you've done enough.

What if we can't determine the root cause?

Document what you do know, what remains uncertain, and why. Implement mitigations based on best understanding. Sometimes uncertainty is the answer.

Should investigation be blameless?

Focus on system improvement, not individual blame. However, if investigation reveals negligence or misconduct, that must be addressed through appropriate channels.

What if the vendor's AI caused the incident?

Engage the vendor in investigation. Your report documents what you could determine; note where vendor cooperation was needed or lacking.

How do we handle investigation while the system is still needed?

Preserve what you can without disrupting operations. Use monitoring data, logs, and configuration snapshots rather than taking systems offline if possible.


Taking Action

Thorough investigation is what separates organizations that keep having the same incidents from those that genuinely improve. The time invested in understanding what went wrong pays dividends in incidents prevented.

Build investigation capability before you need it—trained people, documented procedures, and preserved access to forensic information.

Ready to strengthen your AI incident investigation capability?

Pertama Partners helps organizations build robust AI incident investigation processes. Our AI Readiness Audit includes incident response and investigation capability assessment.

Book an AI Readiness Audit →


References

  1. NIST. (2023). Computer Security Incident Handling Guide (SP 800-61).
  2. ISACA. (2024). IT Incident Investigation.
  3. ISO/IEC 27043. Incident Investigation Principles and Processes.
  4. ENISA. (2024). Good Practice Guide for Incident Management.
  5. Reason, J. (1997). Managing the Risks of Organizational Accidents.

Frequently Asked Questions

Preserve evidence first, document the incident timeline, identify root causes versus symptoms, interview involved parties, analyze technical logs, and coordinate across technical and business teams.

Preserve model versions, input data, outputs, configuration, logs, user reports, and any modifications made during response. Maintain chain of custody for potential legal needs.

Look beyond immediate technical failures to training data issues, integration problems, operational practices, and governance gaps. Use techniques like "5 Whys" adapted for AI systems.

References

  1. NIST. (2023). *Computer Security Incident Handling Guide (SP 800-61)*.. NIST *Computer Security Incident Handling Guide * (2023)
  2. ISACA. (2024). *IT Incident Investigation*.. ISACA *IT Incident Investigation* (2024)
  3. ISO/IEC 27043. *Incident Investigation Principles and Processes*.. ISO/IEC *Incident Investigation Principles and Processes*
  4. ENISA. (2024). *Good Practice Guide for Incident Management*.. ENISA *Good Practice Guide for Incident Management* (2024)
  5. Reason, J. (1997). *Managing the Risks of Organi. Reason J *Managing the Risks of Organi (1997)
Michael Lansdowne Hauge

Founder & Managing Partner

Founder & Managing Partner at Pertama Partners. Founder of Pertama Group.

incident investigationroot cause analysisforensicsincident responserisk managementAI incident root cause analysisAI forensic investigation methodshow to investigate AI failuresAI incident documentationdigital forensics AI systems

Ready to Apply These Insights to Your Organization?

Book a complimentary AI Readiness Audit to identify opportunities specific to your context.

Book an AI Readiness Audit