AI Training & Capability BuildingGuideAdvanced

Designing AI Competency Tests: Creating Valid & Reliable Assessments

October 27, 202518 minutes min readPertama Partners

For:Chief Learning OfficerL&D DirectorTraining ManagerHR DirectorHR Leader

Build AI skills tests that actually measure capability with validated question design, scoring rubrics, and psychometric quality controls.

Education Student Collaboration - ai training & capability building insights

Key Takeaways

1.Validity is non-negotiable: AI tests must be clearly linked to job tasks and proven to predict real performance.
2.Combine scenario-based knowledge items with authentic performance tasks to measure applied AI capability.
3.Use behaviorally anchored rating scales to improve inter-rater reliability and reduce scorer bias.
4.Run pilots with item analysis, reliability checks, and bias reviews before using assessments for high-stakes decisions.
5.Set cut scores using structured methods like Angoff or contrasting groups, not arbitrary percentages.
6.Document your validation and compliance evidence to withstand scrutiny from leadership, legal, and regulators.

14 min read • 29 sections

Executive Summary

Poorly designed AI competency tests create false confidence: employees pass but can't perform. This guide provides evidence-based principles for designing AI assessments with high validity (measures what it claims to measure) and reliability (consistent results across administrations). Learn how to craft knowledge questions, performance tasks, and scoring systems that accurately predict real-world AI capability.

What you'll learn:

Test validity principles: ensuring assessments measure actual AI competency
Reliability techniques: reducing scorer bias and test-retest variation
Question design frameworks for knowledge, application, and synthesis
Performance task construction for authentic AI skill measurement
Psychometric validation methods to prove assessment quality

Expected outcome: AI competency tests that reliably identify who can use AI effectively, predict job performance, and withstand scrutiny from leadership, legal, and external auditors.

The Cost of Invalid Assessments

What happens when AI competency tests are poorly designed:

Scenario 1: False Positives

Employee passes AI competency test (90% score)
Manager assigns them to draft client proposals using AI
Outputs are unusable—require complete rewrite
Client relationship damaged by poor quality

Root cause: Test measured trivia ("What is a token?") not capability ("Draft a proposal using effective prompts").

Scenario 2: False Negatives

Experienced employee fails AI test (65% score)
Excluded from AI pilot program
They were already using AI effectively in their role
Organization loses a potential AI champion

Root cause: Test used obscure technical jargon not relevant to job tasks.

Scenario 3: Legal Liability

AI competency test used for promotion decisions
Disproportionately fails older workers
Discrimination lawsuit filed
No evidence test predicts job performance

Root cause: No validation study proving test relevance to role requirements.

The fix: Design assessments using validated psychometric principles.

Test Validity: Does It Measure What It Claims?

Validity is the most important quality of any assessment. An AI competency test is valid if scores correlate with actual AI performance on the job.

Types of Validity

1. Content Validity

Definition: Test content represents the domain of AI skills required for the role.

How to establish:

Map test items to job task analysis
Subject matter expert (SME) review panel confirms relevance
Coverage matrix ensures all critical skills are assessed

Example: For a marketing role using AI:

✅ "Use AI to draft 3 social media posts from this blog article" (job-relevant)
❌ "Explain the architecture of a transformer model" (not job-relevant)

2. Construct Validity

Definition: Test measures the theoretical construct of "AI competency" as defined.

How to establish:

Factor analysis shows items cluster into expected dimensions (e.g., prompt engineering, output evaluation, ethical judgment)
Scores correlate with related constructs (tech fluency, problem-solving ability)
Scores don't correlate with unrelated constructs (age, tenure)

3. Criterion Validity

Definition: Test scores predict external outcomes (job performance, productivity, manager ratings).

How to establish:

Concurrent validity: Employees who score high also demonstrate high AI proficiency in current work
Predictive validity: New hires who score high become proficient AI users faster

Validation study example:

Administer AI competency test to 100 employees
Collect manager ratings of AI proficiency 3 months later
Calculate correlation (target: r > 0.50)
If correlation is strong, test has predictive validity

Test Reliability: Consistency Across Administrations

Reliability means the test produces consistent results. An unreliable test is useless—scores fluctuate randomly.

Types of Reliability

1. Test-Retest Reliability

Definition: Same person gets similar scores when taking the test twice (with time gap).

How to measure:

Give test to 30 people
Re-administer 2 weeks later
Calculate correlation between Time 1 and Time 2 scores
Target: r > 0.80

Common issues:

Too easy: Everyone scores 90%+ on both attempts (ceiling effect)
Memory effect: People remember questions from first attempt
Learning effect: People improved their skills between tests

Fix: Use parallel forms (two versions of same test with different questions)

2. Inter-Rater Reliability

Definition: Different scorers assign similar scores to the same performance.

How to measure:

Have 2 raters independently score the same 20 submissions
Calculate agreement percentage or Cohen's kappa
Target: >85% exact agreement or kappa > 0.75

Common issues:

Vague rubrics: "Good prompt quality" (subjective)
Halo effect: Rater's overall impression influences all scores
Leniency/severity: Some raters consistently score higher/lower

Fix: Use behaviorally anchored rating scales (BARS) with specific examples

3. Internal Consistency

Definition: Test items measure the same underlying construct.

How to measure:

Calculate Cronbach's alpha (statistical measure)
Target: α > 0.70 for competency tests

Common issues:

Heterogeneous items: Test mixes unrelated skills (prompt writing + ethical reasoning + data analysis)
Too few items: <10 questions makes alpha unstable

Fix: Use subscales for different competency dimensions

Question Design Framework

Bloom's Taxonomy for AI Assessments

Align questions with cognitive levels:

Level	Definition	AI Example Question
Remember	Recall facts	"What is a hallucination in AI?"
Understand	Explain concepts	"Why might AI outputs contain bias?"
Apply	Use knowledge in new situations	"Use AI to summarize this meeting transcript"
Analyze	Break down information	"Compare these 3 AI-generated summaries—which is most accurate?"
Evaluate	Make judgments	"Should we use this AI output for this task? Why/why not?"
Create	Produce new work	"Design an AI workflow for monthly reporting"

Assessment design principle:

Literacy tests: Focus on Remember + Understand (Levels 1-2)
Fluency tests: Focus on Apply + Analyze (Levels 3-4)
Mastery tests: Focus on Evaluate + Create (Levels 5-6)

Writing Effective Multiple-Choice Questions

Bad example:

Q: ChatGPT was created by: A) Google B) Meta C) OpenAI ✓ D) Microsoft

Why it's bad: Tests trivia, not competency. Googleable.

Better example:

Q: You're using AI to draft a performance review. The output is factually accurate but sounds overly harsh. What should you do? A) Send it as-is—AI is objective B) Refine the prompt to request a more constructive tone ✓ C) Manually rewrite the entire review D) Abandon AI for this task

Why it's better: Tests judgment and application, not memorization.

Question Design Checklist

✅ Stems (question portion):

Complete sentence that poses a clear problem
No negative phrasing ("Which is NOT...") unless necessary
Sufficient context to answer without guessing

✅ Options (answer choices):

One clearly correct answer
3-4 plausible distractors (wrong but believable)
Similar length across options
No "all of the above" or "none of the above"

✅ Distractor quality:

Represent common misconceptions
Not obviously wrong
Grammatically parallel

Performance Task Design

Authenticity Criteria

Performance tasks should mirror real work. Use the RACE framework:

R - Realistic: Matches actual job tasks
A - Ambiguous: Requires judgment (no single "right" answer)
C - Constrained: Time limit + resource limit (simulates work pressure)
E - Evaluable: Clear scoring criteria

Sample Performance Task: Email Response

Scenario: You received this customer complaint:

"I ordered your product 2 weeks ago and it still hasn't arrived. Your tracking system says 'in transit' but hasn't updated in 5 days. This is unacceptable. I need this for an event on Friday. Either get it here by Thursday or issue a full refund immediately."

Task: Use AI (ChatGPT, Claude, etc.) to draft a response email that:

Acknowledges the customer's frustration
Explains the situation (you'll provide: "Shipment delayed due to weather, now arriving Monday")
Offers a solution (you can offer: overnight shipping for next order, 15% refund, or full refund + cancellation)
Maintains professional, empathetic tone

Time limit: 10 minutes

Deliverables:

Your prompt(s) to the AI
The AI's output
Your final edited email (ready to send)

Scoring rubric: (See next section)

Scoring Rubric Design

Use Behaviorally Anchored Rating Scales (BARS) to reduce subjectivity:

Dimension 1: Prompt Quality (0-5)

Score	Descriptor	Behavioral Anchor
5	Expert	Prompt includes: customer issue, company policy context, tone requirement, specific facts. AI output needs zero editing.
4	Proficient	Prompt includes most context. AI output needs minor edits (1-2 sentences).
3	Developing	Prompt missing key context (e.g., tone requirement). AI output needs moderate editing (3-5 sentences).
2	Struggling	Prompt vague. AI output requires major rewrite (>50% of content).
1	Insufficient	Prompt minimal ("Write an apology email"). AI output unusable.

Dimension 2: Output Evaluation (0-5)

Score	Descriptor	Behavioral Anchor
5	Expert	Correctly identified AI output issues (if any) and made appropriate edits. Final email is professional, accurate, empathetic.
4	Proficient	Made most necessary edits. Final email is good with 1-2 minor issues.
3	Developing	Missed some AI errors or made unnecessary edits. Final email is acceptable but not polished.
2	Struggling	Didn't catch significant AI errors. Final email has factual errors or tone problems.
1	Insufficient	Sent AI output with minimal review. Final email unprofessional or inaccurate.

Dimension 3: Efficiency (0-5)

Score	Descriptor	Behavioral Anchor
5	Expert	Completed in <7 minutes with 1-2 prompt iterations. Efficient workflow.
4	Proficient	Completed in 7-9 minutes with 2-3 iterations. Reasonable efficiency.
3	Developing	Completed in 9-10 minutes with 4-5 iterations. Some wasted effort.
2	Struggling	Barely completed in 10 minutes or overtime. >5 iterations. Inefficient.
1	Insufficient	Did not complete task in time limit. No viable output.

Total score: Sum of 3 dimensions (max 15 points)

Pass threshold: ≥11 points (73%)

Psychometric Validation Process

Before deploying an AI competency test organization-wide, validate it:

Step 1: Pilot Test (n=30-50)

Objectives:

Test clarity: Do people understand questions?
Time adequacy: Can they finish in allotted time?
Difficulty distribution: Not too easy/hard
Technical issues: Platform glitches

Data to collect:

Completion rate
Time spent per question
Item difficulty (% getting each question correct)
Open feedback ("What was confusing?")

Red flags:

10% don't finish (too long)
Any question with <20% or >95% correct (too hard/easy)
Consistent complaints about unclear instructions

Step 2: Item Analysis

For each question, calculate:

Difficulty (p-value):

Formula: (# who answered correctly) / (# who attempted)
Target range: 0.30 - 0.90
Too easy (>0.90): Doesn't differentiate skill levels
Too hard (<0.30): May be poorly written or off-topic

Discrimination (point-biserial correlation):

Measures whether high performers on overall test also answer this question correctly
Target: r > 0.20
Low discrimination (<0.15): Question might be flawed or irrelevant

Decision rules:

Keep items with p between 0.30-0.90 AND discrimination >0.20
Revise items outside these ranges
Delete items that can't be fixed

Step 3: Reliability Analysis

Internal consistency (Cronbach's alpha):

Calculate for overall test
Target: α > 0.70
If too low: Remove poor items or add more items

Inter-rater reliability (for performance tasks):

Have 2 raters score 20 submissions independently
Calculate percent exact agreement and Cohen's kappa
Target: >85% agreement, kappa >0.75
If too low: Revise scoring rubric for clarity

Step 4: Validity Study

Criterion validity:

Correlate test scores with external measure of AI proficiency:
- Manager ratings
- Peer nominations ("Who uses AI most effectively?")
- Objective usage data (# of AI sessions, time savings)
Target correlation: r > 0.40 (moderate), ideally >0.50 (strong)

Example validation:

Test 100 employees
Ask managers: "Rate this person's AI proficiency on 1-5 scale"
Calculate correlation between test scores and manager ratings
If r = 0.55, test has good criterion validity

Step 5: Bias Analysis

Differential item functioning (DIF):
Check if questions systematically favor certain groups

Process:

Compare performance by demographic group (age, gender, tenure, etc.)
Identify items where groups differ significantly after controlling for overall ability
Revise or remove biased items

Example:

Item: "Use AI to optimize your TikTok marketing strategy"
Older workers score lower even when they have same overall AI competency
Diagnosis: Biased—assumes familiarity with TikTok
Fix: Use neutral platform ("social media platform of your choice")

Standard Setting: Defining "Passing"

How do you determine the cut score (minimum passing score)?

Method 1: Angoff Standard Setting

Process:

Assemble panel of 5-8 subject matter experts (SMEs)
Define "minimally competent" AI user
For each question, SMEs estimate: "What % of minimally competent users would answer this correctly?"
Average estimates across SMEs and questions
Result = cut score

Example:

20 questions on test
SMEs estimate minimally competent user would get: 60%, 70%, 80%, 65%, 75%... (for each question)
Average across questions: 72%
Cut score: 72%

Method 2: Contrasting Groups

Process:

Identify two groups:
- Competent: Known to use AI effectively (manager/peer confirmed)
- Not competent: Known to struggle with AI
Administer test to both groups
Find score that best separates groups (maximizes hits, minimizes false positives/negatives)

Example:

Competent group: Mean score = 82%, SD = 8
Not competent group: Mean score = 58%, SD = 12
Optimal cut score (minimizes misclassification): 70%

Method 3: Normative (Percentile-Based)

Process:

Administer test to representative sample
Set cut score at desired percentile (e.g., 70th percentile)

When to use: When you need to credential top performers (e.g., "AI Champions must score in top 20%")

When NOT to use: When you need to ensure minimum competency for safety/compliance

Legal & Compliance Considerations

EEOC Guidelines (US)

If AI competency test is used for hiring, promotion, or other employment decisions:

Requirements:

Job relatedness: Test must measure skills required for job
Business necessity: Must prove test predicts job performance
Adverse impact analysis: Check if test disproportionately screens out protected groups
Validation evidence: Maintain documentation of validity studies

Red flag: Test screens out 80% of one demographic group but only 50% of another ("four-fifths rule" violation)

Fix: Conduct DIF analysis, remove biased items, or prove test predicts job performance for all groups

If assessing EU-based employees:

Requirements:

Data minimization: Only collect scores needed for decision
Transparency: Inform test-takers how scores will be used
Right to explanation: Employees can request explanation of scoring
Automated decision-making: If test auto-fails candidates, human review required

Key Takeaways

Validity is paramount: A test that doesn't measure real AI competency is worse than no test—it creates false confidence.
Performance tasks are essential for fluency/mastery assessment—knowledge questions alone can't predict applied skill.
Behaviorally anchored rubrics reduce scorer bias and improve inter-rater reliability.
Pilot and validate before scaling: Item analysis, reliability checks, and validity studies prevent costly mistakes.
Standard setting should be evidence-based: Use Angoff or contrasting groups methods, not arbitrary percentages.
Legal compliance requires documentation: Maintain validity studies and bias analyses if using assessments for employment decisions.

Next Steps

This week:

Define the AI competency construct for your organization (what skills matter for each role?)
Draft 5 multiple-choice questions using scenario-based format
Design 1 performance task with BARS rubric

This month:

Pilot test with 30-50 employees
Conduct item analysis (difficulty, discrimination)
Calculate Cronbach's alpha and inter-rater reliability

This quarter:

Conduct criterion validity study (correlate scores with manager ratings or usage data)
Perform bias analysis (DIF by demographic group)
Use Angoff method to set defensible cut score

Partner with Pertama Partners to design, validate, and defend AI competency assessments that meet psychometric and legal standards while accurately measuring real capability.

Frequently Asked Questions

AI literacy focuses on basic understanding of concepts and terminology, while AI competency measures the ability to apply AI tools effectively in real work. Literacy can be assessed with knowledge questions; competency requires scenario-based items and performance tasks aligned to job tasks.

For most workplace competency tests, 25–40 well-designed items plus 1–3 performance tasks are enough to achieve acceptable reliability (Cronbach’s alpha > 0.70). Fewer items can work if they are tightly focused and supported by robust scoring rubrics.

You don’t strictly need a psychometrician for small internal pilots, but for high-stakes uses (hiring, promotion, certification) you should involve someone with psychometric expertise to run item analysis, reliability, validity, and bias checks and to document the evidence.

Review at least annually, or sooner if there are major changes in AI tools or workflows. Use item performance data, SME review, and feedback from test-takers to retire outdated items, add new ones, and revalidate the assessment.

Generic quizzes rarely align with your job tasks and usually lack validation evidence. They can create false positives and legal risk if tied to employment decisions. For certification, design role-specific assessments and validate them against real performance.

Don’t Confuse Trivia with Competency

If your AI test can be passed by memorizing definitions or searching the web, it will not predict on-the-job performance. Prioritize scenario-based questions and performance tasks that mirror real decisions and workflows.

Start Small, Then Harden the Assessment

Begin with a pilot focused on one role or business unit. Use pilot data to refine items, rubrics, and cut scores before scaling the assessment across the organization.

Use Subscales to Target Development

Break your AI competency model into subscales such as prompt engineering, critical evaluation of outputs, workflow design, and ethics. Reporting scores by subscale gives managers clearer development guidance than a single overall score.

0.70

Minimum recommended Cronbach’s alpha for AI competency tests

Source: Psychometric best-practice guidelines

0.50

Target correlation between test scores and manager ratings for strong predictive validity

Source: Applied industrial-organizational psychology practice

"A smaller, well-validated AI assessment is more defensible and predictive than a long, unvalidated test full of trivia."
— Pertama Partners – AI Assessment Practice

"Performance tasks plus behaviorally anchored rubrics are the single most powerful lever for making AI competency assessments both fair and predictive."
— Pertama Partners – AI Assessment Practice

References

Standards for Educational and Psychological Testing. American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014)
Uniform Guidelines on Employee Selection Procedures. U.S. Equal Employment Opportunity Commission (1978)
General Data Protection Regulation (GDPR). European Union (2016)

Designing AI Competency Tests: Creating Valid & Reliable Assessments

Key Takeaways

Executive Summary

The Cost of Invalid Assessments

Test Validity: Does It Measure What It Claims?

Types of Validity

Test Reliability: Consistency Across Administrations

Types of Reliability

Question Design Framework

Bloom's Taxonomy for AI Assessments

Writing Effective Multiple-Choice Questions

Question Design Checklist

Performance Task Design

Authenticity Criteria

Sample Performance Task: Email Response

Scoring Rubric Design

Psychometric Validation Process

Step 1: Pilot Test (n=30-50)

Step 2: Item Analysis

Step 3: Reliability Analysis

Step 4: Validity Study

Step 5: Bias Analysis

Standard Setting: Defining "Passing"

Method 1: Angoff Standard Setting

Method 2: Contrasting Groups

Method 3: Normative (Percentile-Based)

Legal & Compliance Considerations

EEOC Guidelines (US)

GDPR Considerations (EU)

Key Takeaways

Next Steps

Frequently Asked Questions

What is the difference between AI literacy and AI competency in assessments?

How many questions do I need for a reliable AI competency test?

Do I need a statistician or psychometrician to validate my AI assessment?

How often should AI competency tests be updated?

Can we safely use generic AI quizzes from the internet for internal certification?

Don’t Confuse Trivia with Competency

Start Small, Then Harden the Assessment

Use Subscales to Target Development

References

How Pertama Partners Can Help

10x Productivity with AI

AI Adoption Without Chaos

AI Literacy for Execution Teams

Ready to Apply These Insights to Your Organization?

Related Articles