Back to Insights
Prompt Engineering for BusinessGuide

Prompting for Evaluation & Testing — Assess AI Output Quality

February 11, 20267 min readPertama Partners

How to use AI to evaluate and test its own outputs. Self-critique prompts, A/B testing, quality scoring, and systematic evaluation frameworks.

Prompting for Evaluation & Testing — Assess AI Output Quality

Using AI to Evaluate AI

One of the most powerful prompt engineering techniques is using AI to evaluate its own outputs. This creates a quality loop: generate output → evaluate it → improve it → evaluate again.

Self-Critique Prompting

Basic Self-Critique

After generating any output, follow up with:

Review what you just wrote. Identify:

  1. The 3 weakest points or claims
  2. Any statements that might be inaccurate
  3. Where the reasoning could be stronger
  4. What is missing that should be included Then rewrite the output addressing these issues.

Expert Critique

Now review this output as if you were a [specific expert]:

  • A sceptical CFO reviewing a business case
  • An employment lawyer reviewing an HR policy
  • A customer reading a sales proposal What would they find unconvincing, unclear, or missing?

Red Team Analysis

Act as a critic who strongly disagrees with this analysis. What are the 5 strongest counter-arguments? Which claims are most vulnerable to challenge? Where is the evidence weakest?

Quality Scoring

Output Quality Rubric

Score this output on a 1-5 scale for each criterion:

  1. Accuracy — Are all facts and claims correct?
  2. Completeness — Does it address all aspects of the original request?
  3. Clarity — Is it easy to understand for the target audience?
  4. Actionability — Can the reader act on this immediately?
  5. Professionalism — Is the tone and format business-appropriate? For each score below 4, explain what would need to change to earn a 5.

Comparative Quality Assessment

Compare these two versions of [document type]. For each criterion below, indicate which version is better and why:

  1. Clarity of main message
  2. Strength of supporting evidence
  3. Appropriateness of tone
  4. Logical structure
  5. Actionability of recommendations Overall recommendation: which version to use and what improvements to make.

A/B Testing Prompts

Generate and Compare

Write two versions of this [email/proposal/report]: Version A: Formal, data-driven, conservative Version B: Conversational, story-driven, bold Then evaluate both against these criteria: [list] and recommend which to use for [specific audience].

Audience Testing

Read this [communication] from the perspective of each audience:

  1. A CEO (cares about: strategy, ROI, risk)
  2. A department manager (cares about: implementation, resources, timeline)
  3. A frontline employee (cares about: job impact, training, support) For each perspective: what works well, what concerns would they have, and what changes would make it more effective for them.

Systematic Evaluation Frameworks

The GRADE Framework

Evaluate evidence quality in AI outputs:

  • G — Generalisability: Does this apply to our specific context (industry, country, company size)?
  • R — Recency: Is this based on current information or outdated data?
  • A — Accuracy: Can the key claims be verified?
  • D — Depth: Is the analysis superficial or thorough?
  • E — Evidence: Are sources cited and verifiable?

The CLEAR Framework

Evaluate communication quality:

  • C — Concise: Is every word necessary?
  • L — Logical: Does the argument flow?
  • E — Evidence-based: Are claims supported?
  • A — Actionable: Can the reader act on this?
  • R — Relevant: Is everything pertinent to the audience?

Testing for Hallucinations

Fact-Check Prompt

Review this output and identify any claims that might be fabricated or inaccurate. For each claim:

  1. Quote the specific text
  2. Assess confidence: definitely true / probably true / uncertain / probably false / definitely false
  3. Explain your reasoning
  4. Suggest how to verify

Source Verification

List every statistic, study, or source mentioned in this output. For each:

  1. Quote the reference
  2. Can this be verified through a real source?
  3. If uncertain, flag it with [NEEDS VERIFICATION]

Iterative Improvement Loop

The most effective evaluation process:

  1. Generate initial output
  2. Self-critique — ask AI to identify weaknesses
  3. Score — apply a quality rubric
  4. Revise — address identified issues
  5. Expert review — evaluate from target audience perspective
  6. Final polish — adjust tone, format, and emphasis

This loop typically produces publication-quality output in 3-4 rounds.

When to Use Each Technique

SituationBest Technique
First draft of anythingSelf-critique + revise
Important external documentFull rubric scoring + expert critique
Comparing optionsA/B testing + audience perspective
Research or analysisFact-check + source verification
Ongoing content productionQuality rubric as standard check

Frequently Asked Questions

Yes. Self-critique prompting is one of the most effective prompt engineering techniques. Ask AI to identify weaknesses, score against a rubric, critique from an expert perspective, and suggest improvements. This creates an iterative quality loop that significantly improves output quality.

Use multiple techniques: self-critique (identify weaknesses), quality scoring rubrics (rate 1-5 on accuracy, completeness, clarity), A/B comparison (generate two versions and evaluate), audience testing (review from different perspectives), and hallucination checks (verify facts and sources).

Most business content reaches publication quality in 3-4 rounds: (1) initial generation, (2) self-critique and revision, (3) expert perspective review, (4) final polish. High-stakes documents (board papers, client proposals) may need 5-6 rounds including human expert review.

Ready to Apply These Insights to Your Organization?

Book a complimentary AI Readiness Audit to identify opportunities specific to your context.

Book an AI Readiness Audit