
One of the most powerful prompt engineering techniques is using AI to evaluate its own outputs. This creates a quality loop: generate output → evaluate it → improve it → evaluate again.
After generating any output, follow up with:
Review what you just wrote. Identify:
- The 3 weakest points or claims
- Any statements that might be inaccurate
- Where the reasoning could be stronger
- What is missing that should be included Then rewrite the output addressing these issues.
Now review this output as if you were a [specific expert]:
- A sceptical CFO reviewing a business case
- An employment lawyer reviewing an HR policy
- A customer reading a sales proposal What would they find unconvincing, unclear, or missing?
Act as a critic who strongly disagrees with this analysis. What are the 5 strongest counter-arguments? Which claims are most vulnerable to challenge? Where is the evidence weakest?
Score this output on a 1-5 scale for each criterion:
- Accuracy — Are all facts and claims correct?
- Completeness — Does it address all aspects of the original request?
- Clarity — Is it easy to understand for the target audience?
- Actionability — Can the reader act on this immediately?
- Professionalism — Is the tone and format business-appropriate? For each score below 4, explain what would need to change to earn a 5.
Compare these two versions of [document type]. For each criterion below, indicate which version is better and why:
- Clarity of main message
- Strength of supporting evidence
- Appropriateness of tone
- Logical structure
- Actionability of recommendations Overall recommendation: which version to use and what improvements to make.
Write two versions of this [email/proposal/report]: Version A: Formal, data-driven, conservative Version B: Conversational, story-driven, bold Then evaluate both against these criteria: [list] and recommend which to use for [specific audience].
Read this [communication] from the perspective of each audience:
- A CEO (cares about: strategy, ROI, risk)
- A department manager (cares about: implementation, resources, timeline)
- A frontline employee (cares about: job impact, training, support) For each perspective: what works well, what concerns would they have, and what changes would make it more effective for them.
Evaluate evidence quality in AI outputs:
Evaluate communication quality:
Review this output and identify any claims that might be fabricated or inaccurate. For each claim:
- Quote the specific text
- Assess confidence: definitely true / probably true / uncertain / probably false / definitely false
- Explain your reasoning
- Suggest how to verify
List every statistic, study, or source mentioned in this output. For each:
- Quote the reference
- Can this be verified through a real source?
- If uncertain, flag it with [NEEDS VERIFICATION]
The most effective evaluation process:
This loop typically produces publication-quality output in 3-4 rounds.
| Situation | Best Technique |
|---|---|
| First draft of anything | Self-critique + revise |
| Important external document | Full rubric scoring + expert critique |
| Comparing options | A/B testing + audience perspective |
| Research or analysis | Fact-check + source verification |
| Ongoing content production | Quality rubric as standard check |
The discipline of systematic prompt evaluation matured significantly between early 2024 and March 2026, transitioning from informal "looks good to me" assessments toward rigorous measurement methodologies borrowed from software testing and machine learning evaluation traditions.
LLM-as-Judge Methodology. Using one language model to evaluate another model's outputs became the dominant evaluation approach throughout 2025. OpenAI's research team published benchmarking data in July 2025 demonstrating that GPT-4o-based evaluation achieved eighty-seven percent agreement with expert human evaluators for factual accuracy assessment, coherence scoring, and instruction-following compliance. Anthropic published similar findings for Claude-based evaluation in their September 2025 technical report, recommending structured evaluation rubrics with five-point scales across defined quality dimensions.
Automated Evaluation Pipelines. Production prompt engineering teams at organizations including Stripe, Shopify, Notion, and Canva implemented continuous evaluation infrastructure using frameworks that automatically test prompt modifications against curated test datasets before deployment:
Pertama Partners recommends organizations establish evaluation infrastructure following four sequential development phases:
Phase 1 — Golden Dataset Construction (Week 1-2). Curate between fifty and one hundred representative input-output pairs spanning the full range of expected user queries. Include edge cases, adversarial inputs, multilingual requests for Southeast Asian deployments covering English, Bahasa Indonesia, Thai, Vietnamese, and Bahasa Malaysia, and domain-specific technical terminology. Store datasets in version-controlled repositories using JSON Lines format compatible with evaluation frameworks.
Phase 2 — Metric Definition (Week 3). Establish measurable quality dimensions appropriate for each prompt category: factual accuracy scored through source document verification, completeness measured against required information element checklists, formatting compliance validated through structural pattern matching, tone consistency evaluated through LLM-as-judge assessment against calibration examples, and response latency benchmarked against user experience thresholds.
Phase 3 — Baseline Establishment (Week 4). Execute the complete test suite against current production prompts to establish performance baselines across all defined metrics. Document results in evaluation dashboards built through Grafana, Streamlit, or dedicated platforms like Helicone and Langfuse providing real-time monitoring capabilities.
Phase 4 — Continuous Integration (Week 5+). Integrate evaluation execution into deployment pipelines using GitHub Actions, GitLab CI, or Jenkins workflows that automatically execute test suites when prompt templates are modified. Configure quality gate thresholds that prevent deployment of prompt changes producing metric regression exceeding five percent on any defined dimension.
When comparing two prompt variants, practitioners must account for response variance inherent in non-deterministic model outputs. Running each prompt variant against identical test inputs with temperature settings above zero produces different outputs across multiple executions. Reliable comparison requires minimum thirty evaluation runs per variant to achieve statistical significance using paired t-test or Wilcoxon signed-rank test methodologies recommended by Stanford HAI Laboratory publications.
Yes. Self-critique prompting is one of the most effective prompt engineering techniques. Ask AI to identify weaknesses, score against a rubric, critique from an expert perspective, and suggest improvements. This creates an iterative quality loop that significantly improves output quality.
Use multiple techniques: self-critique (identify weaknesses), quality scoring rubrics (rate 1-5 on accuracy, completeness, clarity), A/B comparison (generate two versions and evaluate), audience testing (review from different perspectives), and hallucination checks (verify facts and sources).
Most business content reaches publication quality in 3-4 rounds: (1) initial generation, (2) self-critique and revision, (3) expert perspective review, (4) final polish. High-stakes documents (board papers, client proposals) may need 5-6 rounds including human expert review.