
Most companies evaluate AI in one dimension — either they focus on ROI (how much money does it save?), risk (what could go wrong?), or quality (does it produce good outputs?). But evaluating in only one dimension leads to poor decisions:
This framework evaluates AI initiatives across all three dimensions simultaneously, giving leadership a balanced view for decision-making.
Quality measures how well the AI system performs its intended function. This includes output accuracy, consistency, reliability, and fitness for purpose.
Risk measures the potential negative consequences of AI use, including data privacy exposure, regulatory compliance, bias, security vulnerabilities, and operational dependencies.
ROI measures the business value delivered by the AI system relative to its cost. This includes time savings, cost reduction, revenue impact, and strategic value.
| Metric | Description | How to Measure |
|---|---|---|
| Accuracy | Percentage of AI outputs that are factually correct | Sample 50+ outputs, verify against ground truth |
| Consistency | Same input produces similar quality output | Run identical prompts 10 times, compare variation |
| Completeness | Outputs contain all required information | Review against task requirements checklist |
| Relevance | Outputs address the actual question/task | Expert review of sample outputs |
| Usability | Outputs can be used with minimal editing | Measure edit time before output is usable |
| Latency | Time from input to output | Automated measurement |
| Score | Rating | Description |
|---|---|---|
| 5 | Excellent | >95% accuracy, minimal editing needed, fast and consistent |
| 4 | Good | 85-95% accuracy, light editing, generally reliable |
| 3 | Acceptable | 70-85% accuracy, moderate editing, some inconsistency |
| 2 | Poor | 50-70% accuracy, significant editing, unreliable |
| 1 | Unacceptable | <50% accuracy, outputs frequently wrong or unusable |
Pre-deployment testing:
Ongoing monitoring:
| Category | Key Questions | Severity |
|---|---|---|
| Data privacy | Does it process personal data? Where is data stored? Is data used for training? | High |
| Regulatory compliance | Does use comply with PDPA, MAS, BNM, and industry regulations? | High |
| Bias and fairness | Could outputs discriminate against protected groups? | High |
| Security | Is the tool properly secured? Are there vulnerabilities? | High |
| Accuracy risk | What happens if the output is wrong? What is the downstream impact? | Medium-High |
| Vendor dependency | What happens if the vendor shuts down or changes terms? | Medium |
| Reputational | Could AI use damage the company's reputation with clients or public? | Medium |
| IP and copyright | Are there intellectual property risks with AI-generated content? | Medium |
Use the risk scoring matrix from the AI Risk Assessment Template:
Aggregate risk rating:
| Cost Category | Calculation |
|---|---|
| Time saved | (Hours saved per week × hourly cost × 52 weeks) |
| Headcount avoided | (FTE equivalent × annual fully-loaded cost) |
| Error reduction | (Errors avoided × average cost per error) |
| Outsourcing reduced | (Outsourced work replaced × annual outsourcing cost) |
| Revenue Category | Calculation |
|---|---|
| Faster time to market | (Days saved × daily revenue opportunity) |
| Improved conversion | (Conversion improvement × revenue per customer) |
| Customer retention | (Churn reduction × lifetime customer value) |
| New capabilities | (New revenue enabled × projected annual revenue) |
| Cost Category | Calculation |
|---|---|
| Software licences | (Per user cost × number of users × 12 months) |
| Implementation | (Setup, configuration, integration hours × hourly rate) |
| Training | (Training cost per person × number of people) |
| Ongoing support | (Support hours per month × hourly rate × 12) |
| Governance overhead | (Governance time per month × hourly rate × 12) |
Annual Net Benefit = (Direct Cost Savings + Revenue Impact) - Total Cost of Ownership
ROI Percentage = (Annual Net Benefit / Total Cost of Ownership) × 100
Payback Period = Total Cost of Ownership / (Monthly Net Benefit)
| Score | ROI Rating | Description |
|---|---|---|
| 5 | Exceptional | ROI > 300%, payback < 3 months |
| 4 | Strong | ROI 150-300%, payback 3-6 months |
| 3 | Positive | ROI 50-150%, payback 6-12 months |
| 2 | Marginal | ROI 0-50%, payback 12-18 months |
| 1 | Negative | ROI < 0% or payback > 18 months |
Plot each AI initiative on a three-dimensional evaluation:
| AI Initiative | Quality (1-5) | Risk (1-25, inverted) | ROI (1-5) | Overall Recommendation |
|---|---|---|---|---|
| [Initiative 1] | [Score] | [Score] | [Score] | [Proceed / Caution / Stop] |
| [Initiative 2] | [Score] | [Score] | [Score] | [Proceed / Caution / Stop] |
| Quality | Risk | ROI | Recommendation |
|---|---|---|---|
| 4-5 | Low (1-8) | 4-5 | Proceed — scale aggressively |
| 4-5 | Low (1-8) | 2-3 | Proceed — monitor ROI closely |
| 3-5 | Medium (9-15) | 3-5 | Proceed with caution — implement risk mitigations |
| Any | High (16-25) | Any | Stop — address risk before proceeding |
| 1-2 | Any | Any | Stop — quality is insufficient |
| 3-5 | Low (1-8) | 1 | Reconsider — explore alternatives with better ROI |
Before deploying an AI initiative, establish baseline measurements for quality, risk, and cost metrics.
After a pilot period (typically 4-8 weeks), conduct a full evaluation using this framework.
For deployed AI initiatives, conduct evaluations quarterly or when significant changes occur.
Present the combined evaluation matrix to leadership quarterly, covering all active AI initiatives.
AI ROI is calculated as: (Annual Direct Cost Savings + Revenue Impact - Total Cost of Ownership) / Total Cost of Ownership × 100. Key components include time saved, headcount avoided, error reduction, licence costs, implementation costs, and training costs. Most companies see 100-300% ROI on well-targeted AI initiatives.
For most business applications, a quality score of 4 (Good: 85-95% accuracy, light editing needed) is the minimum for production use. A score of 3 (Acceptable: 70-85% accuracy) may be sufficient for internal drafts that will be heavily reviewed. Scores below 3 indicate the AI tool is not suitable for that use case.
AI initiatives should be evaluated at three stages: pre-deployment (before launch), post-pilot (after 4-8 weeks), and ongoing (quarterly). Additionally, re-evaluate whenever there is a significant vendor update, a change in use case scope, an incident, or a change in regulatory requirements.