Using AI to Evaluate AI
One of the most powerful prompt engineering techniques is using AI to evaluate its own outputs. This creates a quality loop: generate output → evaluate it → improve it → evaluate again.
Self-Critique Prompting
Basic Self-Critique
After generating any output, follow up with:
Review what you just wrote. Identify:
- The 3 weakest points or claims
- Any statements that might be inaccurate
- Where the reasoning could be stronger
- What is missing that should be included Then rewrite the output addressing these issues.
Expert Critique
Now review this output as if you were a [specific expert]:
- A sceptical CFO reviewing a business case
- An employment lawyer reviewing an HR policy
- A customer reading a sales proposal What would they find unconvincing, unclear, or missing?
Red Team Analysis
Act as a critic who strongly disagrees with this analysis. What are the 5 strongest counter-arguments? Which claims are most vulnerable to challenge? Where is the evidence weakest?
Quality Scoring
Output Quality Rubric
Score this output on a 1-5 scale for each criterion:
- Accuracy — Are all facts and claims correct?
- Completeness — Does it address all aspects of the original request?
- Clarity — Is it easy to understand for the target audience?
- Actionability — Can the reader act on this immediately?
- Professionalism — Is the tone and format business-appropriate? For each score below 4, explain what would need to change to earn a 5.
Comparative Quality Assessment
Compare these two versions of [document type]. For each criterion below, indicate which version is better and why:
- Clarity of main message
- Strength of supporting evidence
- Appropriateness of tone
- Logical structure
- Actionability of recommendations Overall recommendation: which version to use and what improvements to make.
A/B Testing Prompts
Generate and Compare
Write two versions of this [email/proposal/report]: Version A: Formal, data-driven, conservative Version B: Conversational, story-driven, bold Then evaluate both against these criteria: [list] and recommend which to use for [specific audience].
Audience Testing
Read this [communication] from the perspective of each audience:
- A CEO (cares about: strategy, ROI, risk)
- A department manager (cares about: implementation, resources, timeline)
- A frontline employee (cares about: job impact, training, support) For each perspective: what works well, what concerns would they have, and what changes would make it more effective for them.
Systematic Evaluation Frameworks
The GRADE Framework
Evaluate evidence quality in AI outputs:
- G — Generalisability: Does this apply to our specific context (industry, country, company size)?
- R — Recency: Is this based on current information or outdated data?
- A — Accuracy: Can the key claims be verified?
- D — Depth: Is the analysis superficial or thorough?
- E — Evidence: Are sources cited and verifiable?
The CLEAR Framework
Evaluate communication quality:
- C — Concise: Is every word necessary?
- L — Logical: Does the argument flow?
- E — Evidence-based: Are claims supported?
- A — Actionable: Can the reader act on this?
- R — Relevant: Is everything pertinent to the audience?
Testing for Hallucinations
Fact-Check Prompt
Review this output and identify any claims that might be fabricated or inaccurate. For each claim:
- Quote the specific text
- Assess confidence: definitely true / probably true / uncertain / probably false / definitely false
- Explain your reasoning
- Suggest how to verify
Source Verification
List every statistic, study, or source mentioned in this output. For each:
- Quote the reference
- Can this be verified through a real source?
- If uncertain, flag it with [NEEDS VERIFICATION]
Iterative Improvement Loop
The most effective evaluation process:
- Generate initial output
- Self-critique — ask AI to identify weaknesses
- Score — apply a quality rubric
- Revise — address identified issues
- Expert review — evaluate from target audience perspective
- Final polish — adjust tone, format, and emphasis
This loop typically produces publication-quality output in 3-4 rounds.
When to Use Each Technique
| Situation | Best Technique |
|---|---|
| First draft of anything | Self-critique + revise |
| Important external document | Full rubric scoring + expert critique |
| Comparing options | A/B testing + audience perspective |
| Research or analysis | Fact-check + source verification |
| Ongoing content production | Quality rubric as standard check |
Related Reading
- ChatGPT Output Evaluation — Evaluate ChatGPT outputs for accuracy and reliability
- AI Evaluation Framework — Measure quality, risk, and ROI of AI implementations
- Prompt Patterns Guide — The core patterns that improve prompt effectiveness
Frequently Asked Questions
Yes. Self-critique prompting is one of the most effective prompt engineering techniques. Ask AI to identify weaknesses, score against a rubric, critique from an expert perspective, and suggest improvements. This creates an iterative quality loop that significantly improves output quality.
Use multiple techniques: self-critique (identify weaknesses), quality scoring rubrics (rate 1-5 on accuracy, completeness, clarity), A/B comparison (generate two versions and evaluate), audience testing (review from different perspectives), and hallucination checks (verify facts and sources).
Most business content reaches publication quality in 3-4 rounds: (1) initial generation, (2) self-critique and revision, (3) expert perspective review, (4) final polish. High-stakes documents (board papers, client proposals) may need 5-6 rounds including human expert review.
