Executive Summary
Poorly designed AI competency tests create false confidence: employees pass but can't perform. This guide provides evidence-based principles for designing AI assessments with high validity (measures what it claims to measure) and reliability (consistent results across administrations). Learn how to craft knowledge questions, performance tasks, and scoring systems that accurately predict real-world AI capability.
What you'll learn:
- Test validity principles: ensuring assessments measure actual AI competency
- Reliability techniques: reducing scorer bias and test-retest variation
- Question design frameworks for knowledge, application, and synthesis
- Performance task construction for authentic AI skill measurement
- Psychometric validation methods to prove assessment quality
Expected outcome: AI competency tests that reliably identify who can use AI effectively, predict job performance, and withstand scrutiny from leadership, legal, and external auditors.
The Cost of Invalid Assessments
What happens when AI competency tests are poorly designed:
Scenario 1: False Positives
- Employee passes AI competency test (90% score)
- Manager assigns them to draft client proposals using AI
- Outputs are unusable—require complete rewrite
- Client relationship damaged by poor quality
Root cause: Test measured trivia ("What is a token?") not capability ("Draft a proposal using effective prompts").
Scenario 2: False Negatives
- Experienced employee fails AI test (65% score)
- Excluded from AI pilot program
- They were already using AI effectively in their role
- Organization loses a potential AI champion
Root cause: Test used obscure technical jargon not relevant to job tasks.
Scenario 3: Legal Liability
- AI competency test used for promotion decisions
- Disproportionately fails older workers
- Discrimination lawsuit filed
- No evidence test predicts job performance
Root cause: No validation study proving test relevance to role requirements.
The fix: Design assessments using validated psychometric principles.
Test Validity: Does It Measure What It Claims?
Validity is the most important quality of any assessment. An AI competency test is valid if scores correlate with actual AI performance on the job.
Types of Validity
1. Content Validity
Definition: Test content represents the domain of AI skills required for the role.
How to establish:
- Map test items to job task analysis
- Subject matter expert (SME) review panel confirms relevance
- Coverage matrix ensures all critical skills are assessed
Example: For a marketing role using AI:
- ✅ "Use AI to draft 3 social media posts from this blog article" (job-relevant)
- ❌ "Explain the architecture of a transformer model" (not job-relevant)
2. Construct Validity
Definition: Test measures the theoretical construct of "AI competency" as defined.
How to establish:
- Factor analysis shows items cluster into expected dimensions (e.g., prompt engineering, output evaluation, ethical judgment)
- Scores correlate with related constructs (tech fluency, problem-solving ability)
- Scores don't correlate with unrelated constructs (age, tenure)
3. Criterion Validity
Definition: Test scores predict external outcomes (job performance, productivity, manager ratings).
How to establish:
- Concurrent validity: Employees who score high also demonstrate high AI proficiency in current work
- Predictive validity: New hires who score high become proficient AI users faster
Validation study example:
- Administer AI competency test to 100 employees
- Collect manager ratings of AI proficiency 3 months later
- Calculate correlation (target: r > 0.50)
- If correlation is strong, test has predictive validity
Test Reliability: Consistency Across Administrations
Reliability means the test produces consistent results. An unreliable test is useless—scores fluctuate randomly.
Types of Reliability
1. Test-Retest Reliability
Definition: Same person gets similar scores when taking the test twice (with time gap).
How to measure:
- Give test to 30 people
- Re-administer 2 weeks later
- Calculate correlation between Time 1 and Time 2 scores
- Target: r > 0.80
Common issues:
- Too easy: Everyone scores 90%+ on both attempts (ceiling effect)
- Memory effect: People remember questions from first attempt
- Learning effect: People improved their skills between tests
Fix: Use parallel forms (two versions of same test with different questions)
2. Inter-Rater Reliability
Definition: Different scorers assign similar scores to the same performance.
How to measure:
- Have 2 raters independently score the same 20 submissions
- Calculate agreement percentage or Cohen's kappa
- Target: >85% exact agreement or kappa > 0.75
Common issues:
- Vague rubrics: "Good prompt quality" (subjective)
- Halo effect: Rater's overall impression influences all scores
- Leniency/severity: Some raters consistently score higher/lower
Fix: Use behaviorally anchored rating scales (BARS) with specific examples
3. Internal Consistency
Definition: Test items measure the same underlying construct.
How to measure:
- Calculate Cronbach's alpha (statistical measure)
- Target: α > 0.70 for competency tests
Common issues:
- Heterogeneous items: Test mixes unrelated skills (prompt writing + ethical reasoning + data analysis)
- Too few items: <10 questions makes alpha unstable
Fix: Use subscales for different competency dimensions
Question Design Framework
Bloom's Taxonomy for AI Assessments
Align questions with cognitive levels:
| Level | Definition | AI Example Question |
|---|---|---|
| Remember | Recall facts | "What is a hallucination in AI?" |
| Understand | Explain concepts | "Why might AI outputs contain bias?" |
| Apply | Use knowledge in new situations | "Use AI to summarize this meeting transcript" |
| Analyze | Break down information | "Compare these 3 AI-generated summaries—which is most accurate?" |
| Evaluate | Make judgments | "Should we use this AI output for this task? Why/why not?" |
| Create | Produce new work | "Design an AI workflow for monthly reporting" |
Assessment design principle:
- Literacy tests: Focus on Remember + Understand (Levels 1-2)
- Fluency tests: Focus on Apply + Analyze (Levels 3-4)
- Mastery tests: Focus on Evaluate + Create (Levels 5-6)
Writing Effective Multiple-Choice Questions
Bad example:
Q: ChatGPT was created by: A) Google B) Meta C) OpenAI ✓ D) Microsoft
Why it's bad: Tests trivia, not competency. Googleable.
Better example:
Q: You're using AI to draft a performance review. The output is factually accurate but sounds overly harsh. What should you do? A) Send it as-is—AI is objective B) Refine the prompt to request a more constructive tone ✓ C) Manually rewrite the entire review D) Abandon AI for this task
Why it's better: Tests judgment and application, not memorization.
Question Design Checklist
✅ Stems (question portion):
- Complete sentence that poses a clear problem
- No negative phrasing ("Which is NOT...") unless necessary
- Sufficient context to answer without guessing
✅ Options (answer choices):
- One clearly correct answer
- 3-4 plausible distractors (wrong but believable)
- Similar length across options
- No "all of the above" or "none of the above"
✅ Distractor quality:
- Represent common misconceptions
- Not obviously wrong
- Grammatically parallel
Performance Task Design
Authenticity Criteria
Performance tasks should mirror real work. Use the RACE framework:
R - Realistic: Matches actual job tasks
A - Ambiguous: Requires judgment (no single "right" answer)
C - Constrained: Time limit + resource limit (simulates work pressure)
E - Evaluable: Clear scoring criteria
Sample Performance Task: Email Response
Scenario: You received this customer complaint:
"I ordered your product 2 weeks ago and it still hasn't arrived. Your tracking system says 'in transit' but hasn't updated in 5 days. This is unacceptable. I need this for an event on Friday. Either get it here by Thursday or issue a full refund immediately."
Task: Use AI (ChatGPT, Claude, etc.) to draft a response email that:
- Acknowledges the customer's frustration
- Explains the situation (you'll provide: "Shipment delayed due to weather, now arriving Monday")
- Offers a solution (you can offer: overnight shipping for next order, 15% refund, or full refund + cancellation)
- Maintains professional, empathetic tone
Time limit: 10 minutes
Deliverables:
- Your prompt(s) to the AI
- The AI's output
- Your final edited email (ready to send)
Scoring rubric: (See next section)
Scoring Rubric Design
Use Behaviorally Anchored Rating Scales (BARS) to reduce subjectivity:
Dimension 1: Prompt Quality (0-5)
| Score | Descriptor | Behavioral Anchor |
|---|---|---|
| 5 | Expert | Prompt includes: customer issue, company policy context, tone requirement, specific facts. AI output needs zero editing. |
| 4 | Proficient | Prompt includes most context. AI output needs minor edits (1-2 sentences). |
| 3 | Developing | Prompt missing key context (e.g., tone requirement). AI output needs moderate editing (3-5 sentences). |
| 2 | Struggling | Prompt vague. AI output requires major rewrite (>50% of content). |
| 1 | Insufficient | Prompt minimal ("Write an apology email"). AI output unusable. |
Dimension 2: Output Evaluation (0-5)
| Score | Descriptor | Behavioral Anchor |
|---|---|---|
| 5 | Expert | Correctly identified AI output issues (if any) and made appropriate edits. Final email is professional, accurate, empathetic. |
| 4 | Proficient | Made most necessary edits. Final email is good with 1-2 minor issues. |
| 3 | Developing | Missed some AI errors or made unnecessary edits. Final email is acceptable but not polished. |
| 2 | Struggling | Didn't catch significant AI errors. Final email has factual errors or tone problems. |
| 1 | Insufficient | Sent AI output with minimal review. Final email unprofessional or inaccurate. |
Dimension 3: Efficiency (0-5)
| Score | Descriptor | Behavioral Anchor |
|---|---|---|
| 5 | Expert | Completed in <7 minutes with 1-2 prompt iterations. Efficient workflow. |
| 4 | Proficient | Completed in 7-9 minutes with 2-3 iterations. Reasonable efficiency. |
| 3 | Developing | Completed in 9-10 minutes with 4-5 iterations. Some wasted effort. |
| 2 | Struggling | Barely completed in 10 minutes or overtime. >5 iterations. Inefficient. |
| 1 | Insufficient | Did not complete task in time limit. No viable output. |
Total score: Sum of 3 dimensions (max 15 points)
Pass threshold: ≥11 points (73%)
Psychometric Validation Process
Before deploying an AI competency test organization-wide, validate it:
Step 1: Pilot Test (n=30-50)
Objectives:
- Test clarity: Do people understand questions?
- Time adequacy: Can they finish in allotted time?
- Difficulty distribution: Not too easy/hard
- Technical issues: Platform glitches
Data to collect:
- Completion rate
- Time spent per question
- Item difficulty (% getting each question correct)
- Open feedback ("What was confusing?")
Red flags:
-
10% don't finish (too long)
- Any question with <20% or >95% correct (too hard/easy)
- Consistent complaints about unclear instructions
Step 2: Item Analysis
For each question, calculate:
Difficulty (p-value):
- Formula: (# who answered correctly) / (# who attempted)
- Target range: 0.30 - 0.90
- Too easy (>0.90): Doesn't differentiate skill levels
- Too hard (<0.30): May be poorly written or off-topic
Discrimination (point-biserial correlation):
- Measures whether high performers on overall test also answer this question correctly
- Target: r > 0.20
- Low discrimination (<0.15): Question might be flawed or irrelevant
Decision rules:
- Keep items with p between 0.30-0.90 AND discrimination >0.20
- Revise items outside these ranges
- Delete items that can't be fixed
Step 3: Reliability Analysis
Internal consistency (Cronbach's alpha):
- Calculate for overall test
- Target: α > 0.70
- If too low: Remove poor items or add more items
Inter-rater reliability (for performance tasks):
- Have 2 raters score 20 submissions independently
- Calculate percent exact agreement and Cohen's kappa
- Target: >85% agreement, kappa >0.75
- If too low: Revise scoring rubric for clarity
Step 4: Validity Study
Criterion validity:
- Correlate test scores with external measure of AI proficiency:
- Manager ratings
- Peer nominations ("Who uses AI most effectively?")
- Objective usage data (# of AI sessions, time savings)
- Target correlation: r > 0.40 (moderate), ideally >0.50 (strong)
Example validation:
- Test 100 employees
- Ask managers: "Rate this person's AI proficiency on 1-5 scale"
- Calculate correlation between test scores and manager ratings
- If r = 0.55, test has good criterion validity
Step 5: Bias Analysis
Differential item functioning (DIF):
Check if questions systematically favor certain groups
Process:
- Compare performance by demographic group (age, gender, tenure, etc.)
- Identify items where groups differ significantly after controlling for overall ability
- Revise or remove biased items
Example:
- Item: "Use AI to optimize your TikTok marketing strategy"
- Older workers score lower even when they have same overall AI competency
- Diagnosis: Biased—assumes familiarity with TikTok
- Fix: Use neutral platform ("social media platform of your choice")
Standard Setting: Defining "Passing"
How do you determine the cut score (minimum passing score)?
Method 1: Angoff Standard Setting
Process:
- Assemble panel of 5-8 subject matter experts (SMEs)
- Define "minimally competent" AI user
- For each question, SMEs estimate: "What % of minimally competent users would answer this correctly?"
- Average estimates across SMEs and questions
- Result = cut score
Example:
- 20 questions on test
- SMEs estimate minimally competent user would get: 60%, 70%, 80%, 65%, 75%... (for each question)
- Average across questions: 72%
- Cut score: 72%
Method 2: Contrasting Groups
Process:
- Identify two groups:
- Competent: Known to use AI effectively (manager/peer confirmed)
- Not competent: Known to struggle with AI
- Administer test to both groups
- Find score that best separates groups (maximizes hits, minimizes false positives/negatives)
Example:
- Competent group: Mean score = 82%, SD = 8
- Not competent group: Mean score = 58%, SD = 12
- Optimal cut score (minimizes misclassification): 70%
Method 3: Normative (Percentile-Based)
Process:
- Administer test to representative sample
- Set cut score at desired percentile (e.g., 70th percentile)
When to use: When you need to credential top performers (e.g., "AI Champions must score in top 20%")
When NOT to use: When you need to ensure minimum competency for safety/compliance
Legal & Compliance Considerations
EEOC Guidelines (US)
If AI competency test is used for hiring, promotion, or other employment decisions:
Requirements:
- Job relatedness: Test must measure skills required for job
- Business necessity: Must prove test predicts job performance
- Adverse impact analysis: Check if test disproportionately screens out protected groups
- Validation evidence: Maintain documentation of validity studies
Red flag: Test screens out 80% of one demographic group but only 50% of another ("four-fifths rule" violation)
Fix: Conduct DIF analysis, remove biased items, or prove test predicts job performance for all groups
GDPR Considerations (EU)
If assessing EU-based employees:
Requirements:
- Data minimization: Only collect scores needed for decision
- Transparency: Inform test-takers how scores will be used
- Right to explanation: Employees can request explanation of scoring
- Automated decision-making: If test auto-fails candidates, human review required
Key Takeaways
- Validity is paramount: A test that doesn't measure real AI competency is worse than no test—it creates false confidence.
- Performance tasks are essential for fluency/mastery assessment—knowledge questions alone can't predict applied skill.
- Behaviorally anchored rubrics reduce scorer bias and improve inter-rater reliability.
- Pilot and validate before scaling: Item analysis, reliability checks, and validity studies prevent costly mistakes.
- Standard setting should be evidence-based: Use Angoff or contrasting groups methods, not arbitrary percentages.
- Legal compliance requires documentation: Maintain validity studies and bias analyses if using assessments for employment decisions.
Next Steps
This week:
- Define the AI competency construct for your organization (what skills matter for each role?)
- Draft 5 multiple-choice questions using scenario-based format
- Design 1 performance task with BARS rubric
This month:
- Pilot test with 30-50 employees
- Conduct item analysis (difficulty, discrimination)
- Calculate Cronbach's alpha and inter-rater reliability
This quarter:
- Conduct criterion validity study (correlate scores with manager ratings or usage data)
- Perform bias analysis (DIF by demographic group)
- Use Angoff method to set defensible cut score
Partner with Pertama Partners to design, validate, and defend AI competency assessments that meet psychometric and legal standards while accurately measuring real capability.
Frequently Asked Questions
AI literacy focuses on basic understanding of concepts and terminology, while AI competency measures the ability to apply AI tools effectively in real work. Literacy can be assessed with knowledge questions; competency requires scenario-based items and performance tasks aligned to job tasks.
For most workplace competency tests, 25–40 well-designed items plus 1–3 performance tasks are enough to achieve acceptable reliability (Cronbach’s alpha > 0.70). Fewer items can work if they are tightly focused and supported by robust scoring rubrics.
You don’t strictly need a psychometrician for small internal pilots, but for high-stakes uses (hiring, promotion, certification) you should involve someone with psychometric expertise to run item analysis, reliability, validity, and bias checks and to document the evidence.
Review at least annually, or sooner if there are major changes in AI tools or workflows. Use item performance data, SME review, and feedback from test-takers to retire outdated items, add new ones, and revalidate the assessment.
Generic quizzes rarely align with your job tasks and usually lack validation evidence. They can create false positives and legal risk if tied to employment decisions. For certification, design role-specific assessments and validate them against real performance.
Don’t Confuse Trivia with Competency
If your AI test can be passed by memorizing definitions or searching the web, it will not predict on-the-job performance. Prioritize scenario-based questions and performance tasks that mirror real decisions and workflows.
Start Small, Then Harden the Assessment
Begin with a pilot focused on one role or business unit. Use pilot data to refine items, rubrics, and cut scores before scaling the assessment across the organization.
Use Subscales to Target Development
Break your AI competency model into subscales such as prompt engineering, critical evaluation of outputs, workflow design, and ethics. Reporting scores by subscale gives managers clearer development guidance than a single overall score.
Minimum recommended Cronbach’s alpha for AI competency tests
Source: Psychometric best-practice guidelines
Target correlation between test scores and manager ratings for strong predictive validity
Source: Applied industrial-organizational psychology practice
"A smaller, well-validated AI assessment is more defensible and predictive than a long, unvalidated test full of trivia."
— Pertama Partners – AI Assessment Practice
"Performance tasks plus behaviorally anchored rubrics are the single most powerful lever for making AI competency assessments both fair and predictive."
— Pertama Partners – AI Assessment Practice
References
- Standards for Educational and Psychological Testing. American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014)
- Uniform Guidelines on Employee Selection Procedures. U.S. Equal Employment Opportunity Commission (1978)
- General Data Protection Regulation (GDPR). European Union (2016)
