Back to Insights
AI Training & Capability BuildingGuideAdvanced

Designing AI Competency Tests: Creating Valid & Reliable Assessments

October 27, 202518 minutes min readPertama Partners
For:Chief Learning OfficerL&D DirectorTraining ManagerHR DirectorHR Leader

Build AI skills tests that actually measure capability with validated question design, scoring rubrics, and psychometric quality controls.

Education Student Collaboration - ai training & capability building insights

Key Takeaways

  • 1.Validity is non-negotiable: AI tests must be clearly linked to job tasks and proven to predict real performance.
  • 2.Combine scenario-based knowledge items with authentic performance tasks to measure applied AI capability.
  • 3.Use behaviorally anchored rating scales to improve inter-rater reliability and reduce scorer bias.
  • 4.Run pilots with item analysis, reliability checks, and bias reviews before using assessments for high-stakes decisions.
  • 5.Set cut scores using structured methods like Angoff or contrasting groups, not arbitrary percentages.
  • 6.Document your validation and compliance evidence to withstand scrutiny from leadership, legal, and regulators.

Executive Summary

Poorly designed AI competency tests create false confidence: employees pass but can't perform. This guide provides evidence-based principles for designing AI assessments with high validity (measures what it claims to measure) and reliability (consistent results across administrations). Learn how to craft knowledge questions, performance tasks, and scoring systems that accurately predict real-world AI capability.

What you'll learn:

  • Test validity principles: ensuring assessments measure actual AI competency
  • Reliability techniques: reducing scorer bias and test-retest variation
  • Question design frameworks for knowledge, application, and synthesis
  • Performance task construction for authentic AI skill measurement
  • Psychometric validation methods to prove assessment quality

Expected outcome: AI competency tests that reliably identify who can use AI effectively, predict job performance, and withstand scrutiny from leadership, legal, and external auditors.


The Cost of Invalid Assessments

What happens when AI competency tests are poorly designed:

Scenario 1: False Positives

  • Employee passes AI competency test (90% score)
  • Manager assigns them to draft client proposals using AI
  • Outputs are unusable—require complete rewrite
  • Client relationship damaged by poor quality

Root cause: Test measured trivia ("What is a token?") not capability ("Draft a proposal using effective prompts").


Scenario 2: False Negatives

  • Experienced employee fails AI test (65% score)
  • Excluded from AI pilot program
  • They were already using AI effectively in their role
  • Organization loses a potential AI champion

Root cause: Test used obscure technical jargon not relevant to job tasks.


Scenario 3: Legal Liability

  • AI competency test used for promotion decisions
  • Disproportionately fails older workers
  • Discrimination lawsuit filed
  • No evidence test predicts job performance

Root cause: No validation study proving test relevance to role requirements.


The fix: Design assessments using validated psychometric principles.

Test Validity: Does It Measure What It Claims?

Validity is the most important quality of any assessment. An AI competency test is valid if scores correlate with actual AI performance on the job.

Types of Validity

1. Content Validity

Definition: Test content represents the domain of AI skills required for the role.

How to establish:

  • Map test items to job task analysis
  • Subject matter expert (SME) review panel confirms relevance
  • Coverage matrix ensures all critical skills are assessed

Example: For a marketing role using AI:

  • ✅ "Use AI to draft 3 social media posts from this blog article" (job-relevant)
  • ❌ "Explain the architecture of a transformer model" (not job-relevant)

2. Construct Validity

Definition: Test measures the theoretical construct of "AI competency" as defined.

How to establish:

  • Factor analysis shows items cluster into expected dimensions (e.g., prompt engineering, output evaluation, ethical judgment)
  • Scores correlate with related constructs (tech fluency, problem-solving ability)
  • Scores don't correlate with unrelated constructs (age, tenure)

3. Criterion Validity

Definition: Test scores predict external outcomes (job performance, productivity, manager ratings).

How to establish:

  • Concurrent validity: Employees who score high also demonstrate high AI proficiency in current work
  • Predictive validity: New hires who score high become proficient AI users faster

Validation study example:

  • Administer AI competency test to 100 employees
  • Collect manager ratings of AI proficiency 3 months later
  • Calculate correlation (target: r > 0.50)
  • If correlation is strong, test has predictive validity

Test Reliability: Consistency Across Administrations

Reliability means the test produces consistent results. An unreliable test is useless—scores fluctuate randomly.

Types of Reliability

1. Test-Retest Reliability

Definition: Same person gets similar scores when taking the test twice (with time gap).

How to measure:

  • Give test to 30 people
  • Re-administer 2 weeks later
  • Calculate correlation between Time 1 and Time 2 scores
  • Target: r > 0.80

Common issues:

  • Too easy: Everyone scores 90%+ on both attempts (ceiling effect)
  • Memory effect: People remember questions from first attempt
  • Learning effect: People improved their skills between tests

Fix: Use parallel forms (two versions of same test with different questions)


2. Inter-Rater Reliability

Definition: Different scorers assign similar scores to the same performance.

How to measure:

  • Have 2 raters independently score the same 20 submissions
  • Calculate agreement percentage or Cohen's kappa
  • Target: >85% exact agreement or kappa > 0.75

Common issues:

  • Vague rubrics: "Good prompt quality" (subjective)
  • Halo effect: Rater's overall impression influences all scores
  • Leniency/severity: Some raters consistently score higher/lower

Fix: Use behaviorally anchored rating scales (BARS) with specific examples


3. Internal Consistency

Definition: Test items measure the same underlying construct.

How to measure:

  • Calculate Cronbach's alpha (statistical measure)
  • Target: α > 0.70 for competency tests

Common issues:

  • Heterogeneous items: Test mixes unrelated skills (prompt writing + ethical reasoning + data analysis)
  • Too few items: <10 questions makes alpha unstable

Fix: Use subscales for different competency dimensions


Question Design Framework

Bloom's Taxonomy for AI Assessments

Align questions with cognitive levels:

LevelDefinitionAI Example Question
RememberRecall facts"What is a hallucination in AI?"
UnderstandExplain concepts"Why might AI outputs contain bias?"
ApplyUse knowledge in new situations"Use AI to summarize this meeting transcript"
AnalyzeBreak down information"Compare these 3 AI-generated summaries—which is most accurate?"
EvaluateMake judgments"Should we use this AI output for this task? Why/why not?"
CreateProduce new work"Design an AI workflow for monthly reporting"

Assessment design principle:

  • Literacy tests: Focus on Remember + Understand (Levels 1-2)
  • Fluency tests: Focus on Apply + Analyze (Levels 3-4)
  • Mastery tests: Focus on Evaluate + Create (Levels 5-6)

Writing Effective Multiple-Choice Questions

Bad example:

Q: ChatGPT was created by: A) Google B) Meta C) OpenAI ✓ D) Microsoft

Why it's bad: Tests trivia, not competency. Googleable.


Better example:

Q: You're using AI to draft a performance review. The output is factually accurate but sounds overly harsh. What should you do? A) Send it as-is—AI is objective B) Refine the prompt to request a more constructive tone ✓ C) Manually rewrite the entire review D) Abandon AI for this task

Why it's better: Tests judgment and application, not memorization.


Question Design Checklist

Stems (question portion):

  • Complete sentence that poses a clear problem
  • No negative phrasing ("Which is NOT...") unless necessary
  • Sufficient context to answer without guessing

Options (answer choices):

  • One clearly correct answer
  • 3-4 plausible distractors (wrong but believable)
  • Similar length across options
  • No "all of the above" or "none of the above"

Distractor quality:

  • Represent common misconceptions
  • Not obviously wrong
  • Grammatically parallel

Performance Task Design

Authenticity Criteria

Performance tasks should mirror real work. Use the RACE framework:

R - Realistic: Matches actual job tasks
A - Ambiguous: Requires judgment (no single "right" answer)
C - Constrained: Time limit + resource limit (simulates work pressure)
E - Evaluable: Clear scoring criteria


Sample Performance Task: Email Response

Scenario: You received this customer complaint:

"I ordered your product 2 weeks ago and it still hasn't arrived. Your tracking system says 'in transit' but hasn't updated in 5 days. This is unacceptable. I need this for an event on Friday. Either get it here by Thursday or issue a full refund immediately."

Task: Use AI (ChatGPT, Claude, etc.) to draft a response email that:

  1. Acknowledges the customer's frustration
  2. Explains the situation (you'll provide: "Shipment delayed due to weather, now arriving Monday")
  3. Offers a solution (you can offer: overnight shipping for next order, 15% refund, or full refund + cancellation)
  4. Maintains professional, empathetic tone

Time limit: 10 minutes

Deliverables:

  1. Your prompt(s) to the AI
  2. The AI's output
  3. Your final edited email (ready to send)

Scoring rubric: (See next section)


Scoring Rubric Design

Use Behaviorally Anchored Rating Scales (BARS) to reduce subjectivity:

Dimension 1: Prompt Quality (0-5)

ScoreDescriptorBehavioral Anchor
5ExpertPrompt includes: customer issue, company policy context, tone requirement, specific facts. AI output needs zero editing.
4ProficientPrompt includes most context. AI output needs minor edits (1-2 sentences).
3DevelopingPrompt missing key context (e.g., tone requirement). AI output needs moderate editing (3-5 sentences).
2StrugglingPrompt vague. AI output requires major rewrite (>50% of content).
1InsufficientPrompt minimal ("Write an apology email"). AI output unusable.

Dimension 2: Output Evaluation (0-5)

ScoreDescriptorBehavioral Anchor
5ExpertCorrectly identified AI output issues (if any) and made appropriate edits. Final email is professional, accurate, empathetic.
4ProficientMade most necessary edits. Final email is good with 1-2 minor issues.
3DevelopingMissed some AI errors or made unnecessary edits. Final email is acceptable but not polished.
2StrugglingDidn't catch significant AI errors. Final email has factual errors or tone problems.
1InsufficientSent AI output with minimal review. Final email unprofessional or inaccurate.

Dimension 3: Efficiency (0-5)

ScoreDescriptorBehavioral Anchor
5ExpertCompleted in <7 minutes with 1-2 prompt iterations. Efficient workflow.
4ProficientCompleted in 7-9 minutes with 2-3 iterations. Reasonable efficiency.
3DevelopingCompleted in 9-10 minutes with 4-5 iterations. Some wasted effort.
2StrugglingBarely completed in 10 minutes or overtime. >5 iterations. Inefficient.
1InsufficientDid not complete task in time limit. No viable output.

Total score: Sum of 3 dimensions (max 15 points)

Pass threshold: ≥11 points (73%)


Psychometric Validation Process

Before deploying an AI competency test organization-wide, validate it:

Step 1: Pilot Test (n=30-50)

Objectives:

  • Test clarity: Do people understand questions?
  • Time adequacy: Can they finish in allotted time?
  • Difficulty distribution: Not too easy/hard
  • Technical issues: Platform glitches

Data to collect:

  • Completion rate
  • Time spent per question
  • Item difficulty (% getting each question correct)
  • Open feedback ("What was confusing?")

Red flags:

  • 10% don't finish (too long)

  • Any question with <20% or >95% correct (too hard/easy)
  • Consistent complaints about unclear instructions

Step 2: Item Analysis

For each question, calculate:

Difficulty (p-value):

  • Formula: (# who answered correctly) / (# who attempted)
  • Target range: 0.30 - 0.90
  • Too easy (>0.90): Doesn't differentiate skill levels
  • Too hard (<0.30): May be poorly written or off-topic

Discrimination (point-biserial correlation):

  • Measures whether high performers on overall test also answer this question correctly
  • Target: r > 0.20
  • Low discrimination (<0.15): Question might be flawed or irrelevant

Decision rules:

  • Keep items with p between 0.30-0.90 AND discrimination >0.20
  • Revise items outside these ranges
  • Delete items that can't be fixed

Step 3: Reliability Analysis

Internal consistency (Cronbach's alpha):

  • Calculate for overall test
  • Target: α > 0.70
  • If too low: Remove poor items or add more items

Inter-rater reliability (for performance tasks):

  • Have 2 raters score 20 submissions independently
  • Calculate percent exact agreement and Cohen's kappa
  • Target: >85% agreement, kappa >0.75
  • If too low: Revise scoring rubric for clarity

Step 4: Validity Study

Criterion validity:

  • Correlate test scores with external measure of AI proficiency:
    • Manager ratings
    • Peer nominations ("Who uses AI most effectively?")
    • Objective usage data (# of AI sessions, time savings)
  • Target correlation: r > 0.40 (moderate), ideally >0.50 (strong)

Example validation:

  • Test 100 employees
  • Ask managers: "Rate this person's AI proficiency on 1-5 scale"
  • Calculate correlation between test scores and manager ratings
  • If r = 0.55, test has good criterion validity

Step 5: Bias Analysis

Differential item functioning (DIF):
Check if questions systematically favor certain groups

Process:

  • Compare performance by demographic group (age, gender, tenure, etc.)
  • Identify items where groups differ significantly after controlling for overall ability
  • Revise or remove biased items

Example:

  • Item: "Use AI to optimize your TikTok marketing strategy"
  • Older workers score lower even when they have same overall AI competency
  • Diagnosis: Biased—assumes familiarity with TikTok
  • Fix: Use neutral platform ("social media platform of your choice")

Standard Setting: Defining "Passing"

How do you determine the cut score (minimum passing score)?

Method 1: Angoff Standard Setting

Process:

  1. Assemble panel of 5-8 subject matter experts (SMEs)
  2. Define "minimally competent" AI user
  3. For each question, SMEs estimate: "What % of minimally competent users would answer this correctly?"
  4. Average estimates across SMEs and questions
  5. Result = cut score

Example:

  • 20 questions on test
  • SMEs estimate minimally competent user would get: 60%, 70%, 80%, 65%, 75%... (for each question)
  • Average across questions: 72%
  • Cut score: 72%

Method 2: Contrasting Groups

Process:

  1. Identify two groups:
    • Competent: Known to use AI effectively (manager/peer confirmed)
    • Not competent: Known to struggle with AI
  2. Administer test to both groups
  3. Find score that best separates groups (maximizes hits, minimizes false positives/negatives)

Example:

  • Competent group: Mean score = 82%, SD = 8
  • Not competent group: Mean score = 58%, SD = 12
  • Optimal cut score (minimizes misclassification): 70%

Method 3: Normative (Percentile-Based)

Process:

  1. Administer test to representative sample
  2. Set cut score at desired percentile (e.g., 70th percentile)

When to use: When you need to credential top performers (e.g., "AI Champions must score in top 20%")

When NOT to use: When you need to ensure minimum competency for safety/compliance


EEOC Guidelines (US)

If AI competency test is used for hiring, promotion, or other employment decisions:

Requirements:

  1. Job relatedness: Test must measure skills required for job
  2. Business necessity: Must prove test predicts job performance
  3. Adverse impact analysis: Check if test disproportionately screens out protected groups
  4. Validation evidence: Maintain documentation of validity studies

Red flag: Test screens out 80% of one demographic group but only 50% of another ("four-fifths rule" violation)

Fix: Conduct DIF analysis, remove biased items, or prove test predicts job performance for all groups


GDPR Considerations (EU)

If assessing EU-based employees:

Requirements:

  1. Data minimization: Only collect scores needed for decision
  2. Transparency: Inform test-takers how scores will be used
  3. Right to explanation: Employees can request explanation of scoring
  4. Automated decision-making: If test auto-fails candidates, human review required

Key Takeaways

  1. Validity is paramount: A test that doesn't measure real AI competency is worse than no test—it creates false confidence.
  2. Performance tasks are essential for fluency/mastery assessment—knowledge questions alone can't predict applied skill.
  3. Behaviorally anchored rubrics reduce scorer bias and improve inter-rater reliability.
  4. Pilot and validate before scaling: Item analysis, reliability checks, and validity studies prevent costly mistakes.
  5. Standard setting should be evidence-based: Use Angoff or contrasting groups methods, not arbitrary percentages.
  6. Legal compliance requires documentation: Maintain validity studies and bias analyses if using assessments for employment decisions.

Next Steps

This week:

  1. Define the AI competency construct for your organization (what skills matter for each role?)
  2. Draft 5 multiple-choice questions using scenario-based format
  3. Design 1 performance task with BARS rubric

This month:

  1. Pilot test with 30-50 employees
  2. Conduct item analysis (difficulty, discrimination)
  3. Calculate Cronbach's alpha and inter-rater reliability

This quarter:

  1. Conduct criterion validity study (correlate scores with manager ratings or usage data)
  2. Perform bias analysis (DIF by demographic group)
  3. Use Angoff method to set defensible cut score

Partner with Pertama Partners to design, validate, and defend AI competency assessments that meet psychometric and legal standards while accurately measuring real capability.

Frequently Asked Questions

AI literacy focuses on basic understanding of concepts and terminology, while AI competency measures the ability to apply AI tools effectively in real work. Literacy can be assessed with knowledge questions; competency requires scenario-based items and performance tasks aligned to job tasks.

For most workplace competency tests, 25–40 well-designed items plus 1–3 performance tasks are enough to achieve acceptable reliability (Cronbach’s alpha > 0.70). Fewer items can work if they are tightly focused and supported by robust scoring rubrics.

You don’t strictly need a psychometrician for small internal pilots, but for high-stakes uses (hiring, promotion, certification) you should involve someone with psychometric expertise to run item analysis, reliability, validity, and bias checks and to document the evidence.

Review at least annually, or sooner if there are major changes in AI tools or workflows. Use item performance data, SME review, and feedback from test-takers to retire outdated items, add new ones, and revalidate the assessment.

Generic quizzes rarely align with your job tasks and usually lack validation evidence. They can create false positives and legal risk if tied to employment decisions. For certification, design role-specific assessments and validate them against real performance.

Don’t Confuse Trivia with Competency

If your AI test can be passed by memorizing definitions or searching the web, it will not predict on-the-job performance. Prioritize scenario-based questions and performance tasks that mirror real decisions and workflows.

Start Small, Then Harden the Assessment

Begin with a pilot focused on one role or business unit. Use pilot data to refine items, rubrics, and cut scores before scaling the assessment across the organization.

Use Subscales to Target Development

Break your AI competency model into subscales such as prompt engineering, critical evaluation of outputs, workflow design, and ethics. Reporting scores by subscale gives managers clearer development guidance than a single overall score.

0.70

Minimum recommended Cronbach’s alpha for AI competency tests

Source: Psychometric best-practice guidelines

0.50

Target correlation between test scores and manager ratings for strong predictive validity

Source: Applied industrial-organizational psychology practice

"A smaller, well-validated AI assessment is more defensible and predictive than a long, unvalidated test full of trivia."

Pertama Partners – AI Assessment Practice

"Performance tasks plus behaviorally anchored rubrics are the single most powerful lever for making AI competency assessments both fair and predictive."

Pertama Partners – AI Assessment Practice

References

  1. Standards for Educational and Psychological Testing. American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014)
  2. Uniform Guidelines on Employee Selection Procedures. U.S. Equal Employment Opportunity Commission (1978)
  3. General Data Protection Regulation (GDPR). European Union (2016)
competency testingassessment designvalidationpsychometricsAI skillsAI certificationL&Dtalent assessmentAI competency test designvalidated AI skills assessmentpsychometric quality in AI testingtest designvalidityreliability

Ready to Apply These Insights to Your Organization?

Book a complimentary AI Readiness Audit to identify opportunities specific to your context.

Book an AI Readiness Audit