Psychological Assessment - Creation and Administration of Psychological Tests
Psychological Test Writing and Evaluation
Evaluating a psychological test involves considering theoretical orientation, practical considerations, standardization, reliability, and validity.
Theoretical Orientation: Understanding the construct the test measures and ensuring items align with the theoretical description.
Practical Considerations: Assessing the reading level required and the test's length.
Standardization: Ensuring the tested population is similar to the standardization sample, the sample size is adequate, subgroup norms are established, and instructions allow for standardized administration.
Reliability: Checking if reliability estimates are sufficiently high (generally around 0.90 for clinical decision making and around 0.70 for research purposes) and considering the trait's stability, method of estimating reliability, and test format implications.
Validity: Examining the criteria and procedures used to validate the test and ensuring accurate measurements for the intended context and purpose.
Guidelines for Writing Psychological Test Items
Clearly define what you want to measure.
Be specific in item construction.
Be informed by relevant theory.
Generate a broad initial item pool.
Avoid redundancy in items.
Start with broad items and refine them.
Avoid creating items that are too lengthy.
Avoid items that are likely to be confusing or misleading.
Write for the appropriate reading level considering the age and background of the target population.
Ensure each item deals with only one concept at a time.
Avoid double-barreled questions like “I vote labor because I support social programs.”
Incorporate a mix of positively and negatively worded questions.
Examples: “I tend to be a happy person” vs. “I often feel sad.”
Choose an appropriate response format.
Item Formats
Dichotomous
True/False or Yes/No questions.
Advantages:
Simplicity and ease of administration and marking.
Suitable for dichotomous constructs.
Disadvantages:
Doesn’t accommodate shades of grey or subtlety.
Can be passed without understanding (rote memorization).
Can lead to guessing (50% chance correct), requiring many items for reliability.
Polytomous
Multiple-Choice Items:
Respondents select one answer from several options, including one correct answer and several distractors.
Example: "Which neurotransmitter is most commonly associated with mood and anxiety regulation?"
Options: Dopamine, Serotonin, GABA, Acetylcholine
Dopamine - reward and pleasure systems
Serotonin
GABA - primary inhibitory neurotransmitter; reduce anxiety and promote relaxation
Acetylcholine - critical role in the autonomic nervous system, influencing functions such as heart rate and digestion
Advantages:
Easy and objective scoring.
Lower chance performance improves reliability.
Easy to analyze statistically.
Disadvantages:
Difficult to produce good distractors.
Reliability only improves with good distractors.
Ideally, produce 2-4 good distractors.
Evidence suggests 3-alternative multiple choice can be as good as 4+ alternatives.
May not capture the depth of knowledge or feelings.
Guessing Issue
Guessing chance =n1, where n = number of choices available.
Example: 3-option MCQ has ~33% chance level.
Example: 4-option MCQ has 25% chance level.
Correction formula for guessing in multiple-choice tests:
Score=R−n−1W, where R = correct responses, W = wrong responses, n = number of choices.
Example Calculation:
80 correct out of 100 4-option multiple-choice questions.
Score=80−4−120=73.33
Likert Format
Measures degree of agreement or satisfaction, typically used to measure attitude.
Various options for points on a Likert scale (e.g., 5pt vs. 7pt vs 10pt).
Example: “Tom Cruise alone can save the world”: strongly disagree, disagree, neutral, agree, strongly agree.
Results can be subjected to Factor Analysis, which is often considered a plus, e.g., intelligence assessment.
Multiple variables have similar response patterns as they are associated with a latent variable(s).
Number of Points:
A scale with more points allows for finer discrimination between responses, capturing nuanced variations in attitudes or opinions. 5 points vs 7 points.
Interpretability:
More points on a Likert scale facilitate clearer interpretation of respondents' attitudes or perceptions. With a greater number of response options, it becomes easier to distinguish between different levels of agreement or disagreement, providing richer insights into respondents' opinions.
Psychometric Properties:
A 7-point scale might provide higher internal consistency and greater discriminant validity compared to a 5-point scale, especially for constructs with more variability or complexity.
Anchors on a Likert scale are critical.
The reference points that respondents use to interpret and assign meaning to their responses.
Unipolar vs. Bipolar Scales:
Unipolar scales indicate the presence or absence of a quality or trait (0, 1, 2, 3, 4).
Bipolar scales balance two different qualities (-2, -1, 0, 1, 2).
Category Format
Respondents choose a category that best describes their response or behavior.
Ideal for measuring constructs that naturally fall into distinct categories, such as levels of physical activity, frequency of behaviors, or types of personality traits.
Example:
Which of the following best describes your level of physical activity in a typical week?
a) Sedentary, b) Lightly active, c) Moderately active, d) Very active, e) Super active
Concerns:
May oversimplify complex behaviors by forcing respondents into fixed categories.
Categories must be mutually exclusive and collectively exhaustive.
Respondents may struggle to place themselves into a single category.
Visual Analogue Scale (VAS)
Respondents mark a point on a continuous line to indicate their level of agreement, intensity, or frequency.
Advantages:
Captures fine-grained data on intensity.
Highly flexible for various constructs.
Concerns:
More difficult to score manually.
May be less familiar to respondents.
Checklists
Adjective checklist with binary options: Endorse or not (tick or leave blank).
Commonly used in personality tests like the Minnesota Multiphasic Personality Inventory (MMPI).
Q-Sorts
Sort statements into piles from most to least descriptive.
Like a categorical scale for statements.
Most statements are piled in the middle, with extremes indicating interesting information about the person.
Example statements:
I have a wide range of interests.
I am productive, get things done.
I am easily irritated.
Open-Ended Items
Respondents provide a written response to a question or prompt.
Example: Describe a situation in which you felt particularly stressed.
Advantages:
Allows for detailed, qualitative responses.
Can capture complex or unexpected insights.
Concerns:
Subjective scoring.
Time-consuming to analyze.
Can be more demanding for respondents.
Fill-in-the-Blank Items
Respondents complete a sentence or phrase.
Example: I feel most anxious when __.
Advantages:
Combines structure with openness for response.
Useful for assessing recall or specific knowledge.
Concerns:
May require subjective interpretation in scoring.
Can lead to ambiguous answers.
How to Write Items
General suggestions for item writing (Frey et al., 2005):
Avoid "All of the Above" and "None of the Above" as answer options.
Ensure all answer options are plausible.
Use logical or varied order of answer options.
Cover important concepts and objectives.
Avoid negative wording and specific determiners (e.g., always, never).
Answer options should include only one correct answer and be grammatically consistent with the stem.
Answer options should be homogenous and not longer than the stem.
Stems must be unambiguous and clearly state the problem.
Correct answer options should not be the longest.
Use appropriate vocabulary.
In fill-in-the-blank items, use a single blank at the end.
Items should be independent of each other.
In matching, there should be more answer options than stems.
All parts of an item should appear on the same page.
True-false items should have simple structure and be entirely true or entirely false.
There should be 3-5 answer options.
Answer options should not have repetitive wording.
Point value of items should be presented.
Stems and examples should not be directly from the textbook.
Item Analysis
Item analysis is a critical process in the development and evaluation of tests, questionnaires, and surveys.
It involves examining the performance of individual test items to ensure they effectively measure the intended constructs and contribute to the overall reliability and validity of the assessment tool.
Item analysis helps in refining and improving test items based on statistical data and psychometric properties.
Two Numerical indices in Item Analysis:
Item Difficulty
Item Discriminability
Procedures involve statistical measures to review and revise items, estimate potential test forms, and make judgments about item quality.
Item Difficulty
How hard is the item?
The proportion/percentage of people who got an item correct.
Range - from chance to ceiling.
Chance:
Yes/No: chance = 0.5
4 item MC: chance = 0.25
Ceiling: 1.0, what is the problem with item difficulty = 1.0?
Optimum Item Difficulty
Optimum item difficulty: ~halfway between chance and ceiling (1).
Optimum=2Chance+1
Example:
Multiple-choice with 2 options:
Chance = 0.5; Ceiling = 1.0
Optimum=20.5+1=0.75
Multiple-choice with 4 options:
Chance = 0.25; Ceiling = 1.0
Optimum=20.25+1=0.625
Purpose of the test:
Discriminate high achievers: use many high difficulty items (e.g., specialist selection).
Discriminate low achievers: use many low difficulty items (e.g., for resource allocation).
Alleviate test anxiety: include some easy items.
Item Discriminability
Item difficulty considers the performance of every participant who completed the test.
What proportion of test-takers got the item correct?
What if we consider subgroups of the test participants, grouped on the basis of their overall level of performance on the test?
How did people who did poorly overall on the test do on item X?
What about people who did very well overall on the test?
"Who gets this item correct?"
Relationship between performance on a given item and on the test as a whole
“Who does well on this item?”
“If you do well on this item, do you do well on the test as a whole?”
Item discriminability can be evaluated in a variety of ways…
Extreme Group Method
Split group into thirds based on overall performance.
Group 1: top third of participants.
Group 3: bottom third of participants.
Group 2: remaining participants.
Discrimination Index:
\text{% correct in top 3rd} - \text{% of people correct in bottom 3rd}
Or % of people correct in Group 1 - % of people correct in Group 3
Point Biserial Method
Correlation between performance on the item and performance on the total test.
Xˉcorrect is the mean score on the test for those who got item X correct.
Xˉall is the mean score on the test for all persons.
Sall is the standard deviation of the score for all persons.
p is the proportion of persons getting the item correct.
Evaluating Item Discriminability
If there is a negative or low correlation between a certain item and overall score: eliminate from test (why?).
If close to 1: good item, discriminates between good and bad performers on the test.
Note that if nearly everybody gets a given item correct or incorrect there is not enough variability for there to be a substantial correlation (rpbis) with total test score.
On a test with only a few items, using this correlation method is problematic.
Bound to be a positive correlation between score on one item and the total score.
One way is to calculate the correlation between the item and the total score derived from the rest of the items (e.g., correlation between item 1 and the sum of items 2-6).
Item Characteristic Curve
Curve showing the relationship between the probability of answering correctly and the ability level.
"Good" test item as opposed to "Bad" test item
Problematic test item
Flat ranges suggest low sensitivity
Difficulty vs. Discriminability
The sweet spot combines appropriate difficulty and high discriminability.
Item Response Theory (IRT)
Emphasis on responses to particular items versus performance across a complete sample of items as in Classical Test Theory.
Item Response Theory makes extensive use of item analysis techniques
Steps of Developing A Test Using Item Analysis and Item Response Theory
Define construct
Generate Initial Item Pool
Pilot testing
~300 participants
Apply Item Analysis
Get item difficulty and item discriminability
Remove low quality items
Apply IRT
Select appropriate model
Make sure items fit the model well
Get Test Information Function to understand the test quality
Assumption: latent ability of test-taker is independent of the particular test.
Attempts to model the relationship between the probability of answering an item correctly (item difficulty) and the underlying ability of the test taker.
Assumes that it does not matter which items are used to estimate the test-taker's ability.
Thus, can compare results across examinees, even if exposed to different items.
Idea is to identify the probability of getting item X wrong at different overall ability levels.
Each item has its own item characteristic curve (where x-axis is now ability level, not total test score).
Method of administration using IRT: (see week 2)
Get first pass idea of overall ability level
Spend time discriminating around that estimated level
Score defined by level of difficulty of items examinee can answer correctly, NOT by total number of correct items
Advantages:
Efficient testing: don’t waste time on too easy/hard items
Examinee motivation: don’t expose to repetitive failure
Reduced likelihood of cheating: each examinee sees different items
Problems/Limits:
Do not help the students learn
Need very large pool of well characterized items
Requires computer administration
Test Administration
Various online survey platforms exist, such as Google Forms, SurveyMonkey, Qualtrics, and REDCap, each offering a variety of question/response types.
The way a test is administered can influence its result.
Test administration can be a source of error.
Administrator characteristics can influence the result.
Test-taker characteristics can influence the result.
E.g., confident versus high test-anxiety.
Test setting can influence the result:
E.g., quiet office versus at school assessment for a child with ADHD.
The Examiner-Examinee Relationship
Establish rapport to help the examinee.
Familiarity:
Children unfamiliar with the administrator did significantly worse on a reading test compared to children familiar with the administrator
4 IQ point increase when the examiner was familiar with the test-taker, in general
7.6 IQ point increase when familiarity and lower SES co-occurred
Stereotype Threat
Most people worry how they will perform on testing.
May be worse for groups victimized by negative stereotypes.
Awareness of negative stereotype may inhibit performance.
Instruction “this test captures male female differences”: men score higher, vs. Instruction “this test does not capture male/female differences”: men and women score comparably.
Students engage better with Uni if they think intelligence is malleable.
Stereotype threat:
Depletes working memory
Self-handicapping
Promotes physiological arousal
Arousal good, but only to a point.
Language of Test and Examinee
Linguistic demands can put ESL (English as a second language) and non-English speakers at a disadvantage.
Limited Vocabulary; Cultural Differences.
Even non-verbal tests generally require verbal instructions.
Tests in different languages.
Translation and back-translation.
Response scale – adjectives.
Reliability and validity of a given test is uncertain when:
Given to examinees whose 1st language is not the test language.
Test has been administered via a translation
Examples of Non-verbal Test
Raven’s Progressive Matrices
Wechsler Nonverbal Scale of Ability
The Goodenough-Harris draw-a-person test
Cultural Response Bias
Tendency for social factors to influence the way that people perceive and respond to survey questions
Cultural response bias is verified as contributing to the SWB difference
Expectancy Effects
Data affected by examiner’s expectations.
Confirmation bias: e.g. instruction to examiners shapes scoring (‘on average most will fail’ vs. ‘on average most will pass’).
Interpretation of ambiguous responses (e.g. ‘Similarities’).
Examiner’s perception of examinee:
Give credit if like the examinee (i.e. non-objectivity).
“Expectancy” literature is controversial, though worth noting.
Reinforcing Responses
Reinforcement affects behavior.
Examinee-relevant incentives can improve test scores.
Reward can improve performance especially in children.
Can increase symptom endorsement.
So, most test manuals insist no feedback be given.
Test manuals should give clear instructions, ensuring standardization (i.e. if feedback is given it is of a fixed type for all examinees).
Departure from these can violate reliability & validity.
Training
Administration and scoring errors are a large source of bias.
Typical graduate training: 2-4 administrations of a test (in class).
Importance of fieldwork placements (increasing experience with administration).
The majority of testing practice is obtained in fieldwork placements.
Error rates on WAIS administrations do not decrease until ~10 administrations have been completed!
Computer Administration
Advantages:
Excellent standardization.
Individually tailored sequence administration.
Precision of timing of responses.
Frees up examiner for other duties ($cost).
Patience (computer doesn’t get bored, examinee can’t be rushed by a restless examiner).
Controls bias.
Examinees are more likely to reveal ‘undesirable’ info (e.g., substance use; acknowledging that therapy is not helping).
Subject Variables
The state of the respondent can also be a major source of error when administering a test.
Illness
Insomnia
Test-anxiety (examples of stressed students, GP with memory complaint)