The immediate crisis is contained. Now comes the harder work: figuring out what actually happened, why it happened, and what needs to change so it doesn't happen again.
AI incident investigation differs from traditional IT forensics. AI systems are less deterministic, their failures are often subtle, and understanding what went wrong may require specialized expertise. This guide provides a structured methodology for investigating AI incidents thoroughly.
Executive Summary
- AI investigation has unique challenges: Non-deterministic systems, complex causation, black-box behavior
- Evidence preservation is critical: AI state, inputs, outputs, and logs must be captured before they're lost
- Root cause analysis requires AI expertise: Understanding model failures needs specialized knowledge
- Investigation scope must balance depth with speed: Don't delay remediation for perfect understanding
- Third-party AI complicates investigation: Vendor access and cooperation may be needed
- Documentation serves multiple purposes: Regulatory compliance, legal protection, organizational learning
- Investigation feeds improvement: The goal isn't blame but prevention
Why This Matters Now
Investigations often get short-changed. Once an incident is contained, there's pressure to move on. But inadequate investigation leads to:
- Recurrence: The same incident happens again because root cause wasn't addressed
- Regulatory problems: Authorities expect thorough investigation and documentation
- Legal exposure: Inadequate investigation makes defense harder if litigation occurs
- Lost learning: The organization doesn't improve its AI practices
- Hidden problems: Related issues go undetected
Thorough investigation is an investment, not a cost.
AI Investigation Challenges
Challenge 1: Non-Determinism
AI systems can produce different outputs from the same inputs. Reproducing the exact failure conditions may be impossible.
Approach: Document the statistical behavior, not just individual outputs. Look for patterns across multiple instances.
Challenge 2: Black Box Behavior
Many AI models can't explain why they produced specific outputs. The internal reasoning is opaque.
Approach: Use explainability techniques where available. Focus on what conditions correlate with failures, even if causation is unclear.
Challenge 3: Complex Causation
AI failures often result from multiple interacting factors—data, model, implementation, context—not a single root cause.
Approach: Use multiple root cause analysis techniques. Accept that causation may be multifactorial.
Challenge 4: Temporal State
The model's behavior may have changed since the incident—through drift, updates, or retraining.
Approach: Preserve model state immediately. Document version information. Compare current state to incident-time state.
Challenge 5: Third-Party Systems
If the AI is vendor-provided, you may lack access to investigate internal behavior.
Approach: Engage vendors early. Contractual provisions for incident cooperation are essential. Focus on what you can observe.
AI Incident Investigation Process
Phase 1: Evidence Preservation
Objective: Secure evidence before it's lost or altered
Timing: Immediately upon incident detection, parallel to containment
| Evidence Type | What to Preserve | How to Preserve |
|---|---|---|
| Model state | Model version, weights (if available), configuration | Snapshot, documentation |
| Input data | Inputs that triggered the incident | Copy to secure location |
| Output data | Outputs produced during incident | Export and secure |
| System logs | Application, system, security logs | Export with timestamps |
| Access logs | Who accessed what when | Export and secure |
| Configuration | System settings at time of incident | Snapshot |
| Metrics data | Performance metrics, monitoring data | Export from monitoring systems |
| Related data | Training data, feature data, context | Secure if relevant |
Evidence Chain of Custody
Document for each piece of evidence:
- What was collected
- When it was collected
- Who collected it
- Where it's stored
- Integrity verification (hashes)
Phase 2: Initial Scoping
Objective: Define investigation boundaries
| Question | Purpose |
|---|---|
| What AI system(s) are involved? | Scope technical investigation |
| What is the incident timeline? | Focus investigation period |
| Who might have relevant information? | Plan interviews |
| What documentation exists? | Identify available evidence |
| What is the business impact? | Prioritize investigation depth |
| Are there regulatory implications? | Ensure compliance requirements met |
| Is there potential litigation? | Engage legal early if needed |
Scope Document
Investigation Scope Document
Incident ID: [ID]
Investigation Lead: [Name]
Date: [Date]
SCOPE
- Systems: [List AI systems in scope]
- Time period: [Start] to [End]
- Data: [Types of data in scope]
- People: [Roles/individuals to interview]
OUT OF SCOPE
- [Items explicitly excluded]
OBJECTIVES
1. Determine root cause of incident
2. Assess full impact
3. Identify remediation requirements
4. Document for regulatory/legal purposes
5. Extract lessons learned
CONSTRAINTS
- Investigation deadline: [Date]
- Resource constraints: [If any]
- Access limitations: [If any]
Phase 3: Information Gathering
Objective: Collect all relevant information
Technical Analysis
| Activity | Description | Output |
|---|---|---|
| Log analysis | Review system, application, and security logs | Timeline, anomalies identified |
| Model analysis | Examine model behavior, performance metrics | Model assessment |
| Data analysis | Analyze inputs, outputs, and related data | Data patterns, anomalies |
| System analysis | Review configuration, architecture, integrations | System state documentation |
| Code review | Review relevant code if applicable | Code issues identified |
Interviews
| Interviewee | Purpose | Sample Questions |
|---|---|---|
| First responders | Understand initial discovery and response | What did you observe? What actions did you take? |
| System operators | Understand normal operations and deviations | Was anything unusual before the incident? |
| AI/ML engineers | Technical understanding of the system | How should the system behave? What could cause this? |
| Business users | Business impact and context | What was the real-world effect? |
| Security team | Security context | Any related security events? |
Document Review
- System documentation
- Previous incident reports
- Change records (recent changes to the system)
- Monitoring alerts and reports
- Training data documentation
- Model validation reports
Phase 4: Root Cause Analysis
Objective: Determine what caused the incident and why
Technique 1: 5 Whys
Keep asking "why" until you reach fundamental causes:
Incident: AI chatbot provided incorrect information to customers
Why? → The model generated a response containing false facts
Why? → The model was not trained on recent policy changes
Why? → The retraining pipeline failed two months ago
Why? → Pipeline failure alerts went to a deprecated email address
Why? → Alert configuration wasn't updated during team reorganization
ROOT CAUSE: Alert configuration management process inadequate
Technique 2: Fishbone (Ishikawa) Diagram
Categorize potential causes:
Fishbone (Ishikawa) diagram: categorize potential causes across Data, Model, Process, People, and Systems.
Technique 3: Fault Tree Analysis
Work backward from the incident:
Incident (Top Event)
│
├── Immediate Cause 1
│ │
│ ├── Contributing Factor 1a
│ └── Contributing Factor 1b
│
└── Immediate Cause 2
│
├── Contributing Factor 2a
└── Contributing Factor 2b
AI-Specific Root Cause Categories
| Category | Examples |
|---|---|
| Data issues | Data drift, poisoned data, data quality, missing data, biased data |
| Model issues | Model drift, training problems, architectural limitations, overfitting |
| Implementation issues | Integration bugs, configuration errors, deployment problems |
| Operational issues | Monitoring gaps, inadequate thresholds, response failures |
| Governance issues | Policy gaps, unapproved changes, inadequate oversight |
| External factors | Adversarial attacks, changed operating environment, third-party failures |
Phase 5: Impact Assessment
Objective: Understand full scope of incident impact
| Impact Dimension | Assessment Questions | Quantification |
|---|---|---|
| People affected | How many? Who? | Count, demographics |
| Data compromised | What types? How sensitive? | Data classification |
| Financial | Direct costs? Indirect costs? | Dollar amounts |
| Operational | Business disruption? Duration? | Downtime, affected processes |
| Reputational | Public awareness? Media? | Coverage, sentiment |
| Regulatory | Compliance violations? Notifications? | Specific requirements triggered |
| Legal | Liability exposure? | Potential claims |
Phase 6: Documentation
Objective: Create complete investigation record
Investigation Report Structure
AI INCIDENT INVESTIGATION REPORT
1. EXECUTIVE SUMMARY
- Incident overview
- Key findings
- Root causes
- Recommendations
2. INCIDENT DESCRIPTION
- Timeline
- Systems involved
- Detection method
- Initial response
3. INVESTIGATION METHODOLOGY
- Scope
- Team
- Methods used
- Limitations
4. FINDINGS
- Technical findings
- Process findings
- People findings
- Third-party findings
5. ROOT CAUSE ANALYSIS
- Primary root cause
- Contributing factors
- Analysis methodology
6. IMPACT ASSESSMENT
- Quantified impacts
- Stakeholders affected
- Regulatory implications
7. RECOMMENDATIONS
- Immediate actions
- Short-term improvements
- Long-term improvements
8. LESSONS LEARNED
- What worked well
- What didn't work
- Key takeaways
9. APPENDICES
- Evidence inventory
- Interview summaries
- Technical analysis details
- Timeline
AI Incident Investigation Checklist
Day 1 (Preservation)
- Preserve model state (version, config, weights if accessible)
- Export relevant logs
- Capture input/output data
- Document system state
- Establish chain of custody
- Identify key stakeholders
Week 1 (Core Investigation)
- Define investigation scope
- Conduct technical analysis
- Complete interviews
- Review documentation
- Begin root cause analysis
- Assess impact
Week 2+ (Analysis and Reporting)
- Complete root cause analysis
- Develop recommendations
- Draft investigation report
- Review with stakeholders
- Finalize documentation
- Transfer to post-mortem process
Common Failure Modes
1. Starting Late
Investigation starts after evidence is lost. Begin preservation immediately.
2. Too Narrow Focus
Investigating only the obvious cause while missing systemic issues. Look broadly.
3. Blame-Seeking
Investigation becomes about finding fault rather than understanding and preventing.
4. Stopping at Symptoms
Accepting surface explanations without digging to root causes.
5. Inadequate Documentation
Verbal findings that can't be referenced later. Document everything.
6. No Follow-Through
Recommendations made but never implemented. Track recommendation completion.
Metrics to Track
| Metric | Target |
|---|---|
| Investigation initiation | Within 24 hours of containment |
| Investigation completion | Within 2-4 weeks for significant incidents |
| Root causes identified | At least 1 per incident |
| Recommendations made | At least 1 per root cause |
| Recommendation implementation | >80% within 90 days |
| Recurrence rate | <10% of same incident type within 12 months |
Frequently Asked Questions
How much investigation is enough?
Proportional to impact. Minor incidents need brief investigation; major incidents need thorough analysis. When you can explain what happened, why, and what to change—you've done enough.
What if we can't determine the root cause?
Document what you do know, what remains uncertain, and why. Implement mitigations based on best understanding. Sometimes uncertainty is the answer.
Should investigation be blameless?
Focus on system improvement, not individual blame. However, if investigation reveals negligence or misconduct, that must be addressed through appropriate channels.
What if the vendor's AI caused the incident?
Engage the vendor in investigation. Your report documents what you could determine; note where vendor cooperation was needed or lacking.
How do we handle investigation while the system is still needed?
Preserve what you can without disrupting operations. Use monitoring data, logs, and configuration snapshots rather than taking systems offline if possible.
Taking Action
Thorough investigation is what separates organizations that keep having the same incidents from those that genuinely improve. The time invested in understanding what went wrong pays dividends in incidents prevented.
Build investigation capability before you need it—trained people, documented procedures, and preserved access to forensic information.
Ready to strengthen your AI incident investigation capability?
Pertama Partners helps organizations build robust AI incident investigation processes. Our AI Readiness Audit includes incident response and investigation capability assessment.
References
- NIST. (2023). Computer Security Incident Handling Guide (SP 800-61).
- ISACA. (2024). IT Incident Investigation.
- ISO/IEC 27043. Incident Investigation Principles and Processes.
- ENISA. (2024). Good Practice Guide for Incident Management.
- Reason, J. (1997). Managing the Risks of Organizational Accidents.
Frequently Asked Questions
Preserve evidence first, document the incident timeline, identify root causes versus symptoms, interview involved parties, analyze technical logs, and coordinate across technical and business teams.
Preserve model versions, input data, outputs, configuration, logs, user reports, and any modifications made during response. Maintain chain of custody for potential legal needs.
Look beyond immediate technical failures to training data issues, integration problems, operational practices, and governance gaps. Use techniques like "5 Whys" adapted for AI systems.
References
- NIST. (2023). *Computer Security Incident Handling Guide (SP 800-61)*.. NIST *Computer Security Incident Handling Guide * (2023)
- ISACA. (2024). *IT Incident Investigation*.. ISACA *IT Incident Investigation* (2024)
- ISO/IEC 27043. *Incident Investigation Principles and Processes*.. ISO/IEC *Incident Investigation Principles and Processes*
- ENISA. (2024). *Good Practice Guide for Incident Management*.. ENISA *Good Practice Guide for Incident Management* (2024)
- Reason, J. (1997). *Managing the Risks of Organi. Reason J *Managing the Risks of Organi (1997)

