Psychological Assessment - Creation and Administration of Psychological Tests

Psychological Test Writing and Evaluation

  • Evaluating a psychological test involves considering theoretical orientation, practical considerations, standardization, reliability, and validity.
    • Theoretical Orientation: Understanding the construct the test measures and ensuring items align with the theoretical description.
    • Practical Considerations: Assessing the reading level required and the test's length.
    • Standardization: Ensuring the tested population is similar to the standardization sample, the sample size is adequate, subgroup norms are established, and instructions allow for standardized administration.
    • Reliability: Checking if reliability estimates are sufficiently high (generally around 0.90 for clinical decision making and around 0.70 for research purposes) and considering the trait's stability, method of estimating reliability, and test format implications.
    • Validity: Examining the criteria and procedures used to validate the test and ensuring accurate measurements for the intended context and purpose.

Guidelines for Writing Psychological Test Items

  • Clearly define what you want to measure.
  • Be specific in item construction.
  • Be informed by relevant theory.
  • Generate a broad initial item pool.
  • Avoid redundancy in items.
  • Start with broad items and refine them.
  • Avoid creating items that are too lengthy.
  • Avoid items that are likely to be confusing or misleading.
  • Write for the appropriate reading level considering the age and background of the target population.
  • Ensure each item deals with only one concept at a time.
    • Avoid double-barreled questions like “I vote labor because I support social programs.”
  • Incorporate a mix of positively and negatively worded questions.
    • Examples: “I tend to be a happy person” vs. “I often feel sad.”
  • Choose an appropriate response format.

Item Formats

Dichotomous

  • True/False or Yes/No questions.
  • Advantages:
    • Simplicity and ease of administration and marking.
    • Suitable for dichotomous constructs.
  • Disadvantages:
    • Doesn’t accommodate shades of grey or subtlety.
    • Can be passed without understanding (rote memorization).
    • Can lead to guessing (50% chance correct), requiring many items for reliability.

Polytomous

  • Multiple-Choice Items:
    • Respondents select one answer from several options, including one correct answer and several distractors.
    • Example: "Which neurotransmitter is most commonly associated with mood and anxiety regulation?"
      • Options: Dopamine, Serotonin, GABA, Acetylcholine
    • Dopamine - reward and pleasure systems
    • Serotonin
    • GABA - primary inhibitory neurotransmitter; reduce anxiety and promote relaxation
    • Acetylcholine - critical role in the autonomic nervous system, influencing functions such as heart rate and digestion
  • Advantages:
    • Easy and objective scoring.
    • Lower chance performance improves reliability.
    • Easy to analyze statistically.
  • Disadvantages:
    • Difficult to produce good distractors.
    • Reliability only improves with good distractors.
    • Ideally, produce 2-4 good distractors.
    • Evidence suggests 3-alternative multiple choice can be as good as 4+ alternatives.
    • May not capture the depth of knowledge or feelings.
Guessing Issue
  • Guessing chance =1n= \frac{1}{n}, where nn = number of choices available.
    • Example: 3-option MCQ has ~33% chance level.
    • Example: 4-option MCQ has 25% chance level.
  • Correction formula for guessing in multiple-choice tests:
    • Score=RWn1Score = R - \frac{W}{n-1}, where R = correct responses, W = wrong responses, n = number of choices.
  • Example Calculation:
    • 80 correct out of 100 4-option multiple-choice questions.
    • Score=802041=73.33Score = 80 - \frac{20}{4-1} = 73.33

Likert Format

  • Measures degree of agreement or satisfaction, typically used to measure attitude.
  • Various options for points on a Likert scale (e.g., 5pt vs. 7pt vs 10pt).
    • Example: “Tom Cruise alone can save the world”: strongly disagree, disagree, neutral, agree, strongly agree.
  • Results can be subjected to Factor Analysis, which is often considered a plus, e.g., intelligence assessment.
  • Multiple variables have similar response patterns as they are associated with a latent variable(s).
  • Number of Points:
    • A scale with more points allows for finer discrimination between responses, capturing nuanced variations in attitudes or opinions. 5 points vs 7 points.
  • Interpretability:
    • More points on a Likert scale facilitate clearer interpretation of respondents' attitudes or perceptions. With a greater number of response options, it becomes easier to distinguish between different levels of agreement or disagreement, providing richer insights into respondents' opinions.
  • Psychometric Properties:
    • A 7-point scale might provide higher internal consistency and greater discriminant validity compared to a 5-point scale, especially for constructs with more variability or complexity.
  • Anchors on a Likert scale are critical.
    • The reference points that respondents use to interpret and assign meaning to their responses.
    • Unipolar vs. Bipolar Scales:
      • Unipolar scales indicate the presence or absence of a quality or trait (0, 1, 2, 3, 4).
      • Bipolar scales balance two different qualities (-2, -1, 0, 1, 2).

Category Format

  • Respondents choose a category that best describes their response or behavior.
  • Ideal for measuring constructs that naturally fall into distinct categories, such as levels of physical activity, frequency of behaviors, or types of personality traits.
  • Example:
    • Which of the following best describes your level of physical activity in a typical week?
      • a) Sedentary, b) Lightly active, c) Moderately active, d) Very active, e) Super active
  • Concerns:
    • May oversimplify complex behaviors by forcing respondents into fixed categories.
    • Categories must be mutually exclusive and collectively exhaustive.
    • Respondents may struggle to place themselves into a single category.

Visual Analogue Scale (VAS)

  • Respondents mark a point on a continuous line to indicate their level of agreement, intensity, or frequency.
  • Advantages:
    • Captures fine-grained data on intensity.
    • Highly flexible for various constructs.
  • Concerns:
    • More difficult to score manually.
    • May be less familiar to respondents.

Checklists

  • Adjective checklist with binary options: Endorse or not (tick or leave blank).
  • Commonly used in personality tests like the Minnesota Multiphasic Personality Inventory (MMPI).

Q-Sorts

  • Sort statements into piles from most to least descriptive.
  • Like a categorical scale for statements.
  • Most statements are piled in the middle, with extremes indicating interesting information about the person.
  • Example statements:
    • I have a wide range of interests.
    • I am productive, get things done.
    • I am easily irritated.

Open-Ended Items

  • Respondents provide a written response to a question or prompt.
  • Example: Describe a situation in which you felt particularly stressed.
  • Advantages:
    • Allows for detailed, qualitative responses.
    • Can capture complex or unexpected insights.
  • Concerns:
    • Subjective scoring.
    • Time-consuming to analyze.
    • Can be more demanding for respondents.

Fill-in-the-Blank Items

  • Respondents complete a sentence or phrase.
  • Example: I feel most anxious when __.
  • Advantages:
    • Combines structure with openness for response.
    • Useful for assessing recall or specific knowledge.
  • Concerns:
    • May require subjective interpretation in scoring.
    • Can lead to ambiguous answers.

How to Write Items

  • General suggestions for item writing (Frey et al., 2005):
    • Avoid "All of the Above" and "None of the Above" as answer options.
    • Ensure all answer options are plausible.
    • Use logical or varied order of answer options.
    • Cover important concepts and objectives.
    • Avoid negative wording and specific determiners (e.g., always, never).
    • Answer options should include only one correct answer and be grammatically consistent with the stem.
    • Answer options should be homogenous and not longer than the stem.
    • Stems must be unambiguous and clearly state the problem.
    • Correct answer options should not be the longest.
    • Use appropriate vocabulary.
    • In fill-in-the-blank items, use a single blank at the end.
    • Items should be independent of each other.
    • In matching, there should be more answer options than stems.
    • All parts of an item should appear on the same page.
    • True-false items should have simple structure and be entirely true or entirely false.
    • There should be 3-5 answer options.
    • Answer options should not have repetitive wording.
    • Point value of items should be presented.
    • Stems and examples should not be directly from the textbook.

Item Analysis

  • Item analysis is a critical process in the development and evaluation of tests, questionnaires, and surveys.
  • It involves examining the performance of individual test items to ensure they effectively measure the intended constructs and contribute to the overall reliability and validity of the assessment tool.
  • Item analysis helps in refining and improving test items based on statistical data and psychometric properties.
  • Two Numerical indices in Item Analysis:
    • Item Difficulty
    • Item Discriminability
  • Procedures involve statistical measures to review and revise items, estimate potential test forms, and make judgments about item quality.

Item Difficulty

  • How hard is the item?
  • The proportion/percentage of people who got an item correct.
  • Range - from chance to ceiling.
  • Chance:
    • Yes/No: chance = 0.5
    • 4 item MC: chance = 0.25
  • Ceiling: 1.0, what is the problem with item difficulty = 1.0?
Optimum Item Difficulty
  • Optimum item difficulty: ~halfway between chance and ceiling (1).
    • Optimum=Chance+12Optimum = \frac{Chance + 1}{2}
  • Example:
    • Multiple-choice with 2 options:
      • Chance = 0.5; Ceiling = 1.0
      • Optimum=0.5+12=0.75Optimum = \frac{0.5+1}{2} = 0.75
    • Multiple-choice with 4 options:
      • Chance = 0.25; Ceiling = 1.0
      • Optimum=0.25+12=0.625Optimum = \frac{0.25+1}{2} = 0.625
  • Purpose of the test:
    • Discriminate high achievers: use many high difficulty items (e.g., specialist selection).
    • Discriminate low achievers: use many low difficulty items (e.g., for resource allocation).
    • Alleviate test anxiety: include some easy items.

Item Discriminability

  • Item difficulty considers the performance of every participant who completed the test.
  • What proportion of test-takers got the item correct?
  • What if we consider subgroups of the test participants, grouped on the basis of their overall level of performance on the test?
    • How did people who did poorly overall on the test do on item X?
    • What about people who did very well overall on the test?
  • "Who gets this item correct?"
  • Relationship between performance on a given item and on the test as a whole
  • “Who does well on this item?”
  • “If you do well on this item, do you do well on the test as a whole?”
  • Item discriminability can be evaluated in a variety of ways…
Extreme Group Method
  • Split group into thirds based on overall performance.
    • Group 1: top third of participants.
    • Group 3: bottom third of participants.
    • Group 2: remaining participants.
  • Discrimination Index:
    • \text{% correct in top 3rd} - \text{% of people correct in bottom 3rd}
    • Or % of people correct in Group 1 - % of people correct in Group 3
Point Biserial Method
  • Correlation between performance on the item and performance on the total test.
    • r<em>pbis=Xˉ</em>correctXˉ<em>allS</em>allp1pr<em>{pbis} = \frac{\bar{X}</em>{correct} - \bar{X}<em>{all}}{S</em>{all}} \sqrt{\frac{p}{1-p}}
    • Where:
      • Xˉcorrect\bar{X}_{correct} is the mean score on the test for those who got item X correct.
      • Xˉall\bar{X}_{all} is the mean score on the test for all persons.
      • SallS_{all} is the standard deviation of the score for all persons.
      • pp is the proportion of persons getting the item correct.
Evaluating Item Discriminability
  • If there is a negative or low correlation between a certain item and overall score: eliminate from test (why?).
  • If close to 1: good item, discriminates between good and bad performers on the test.
  • Note that if nearly everybody gets a given item correct or incorrect there is not enough variability for there to be a substantial correlation (rpbisr_{pbis}) with total test score.
  • On a test with only a few items, using this correlation method is problematic.
  • Bound to be a positive correlation between score on one item and the total score.
  • One way is to calculate the correlation between the item and the total score derived from the rest of the items (e.g., correlation between item 1 and the sum of items 2-6).
Item Characteristic Curve
  • Curve showing the relationship between the probability of answering correctly and the ability level.
  • "Good" test item as opposed to "Bad" test item
  • Problematic test item
  • Flat ranges suggest low sensitivity
Difficulty vs. Discriminability
  • The sweet spot combines appropriate difficulty and high discriminability.

Item Response Theory (IRT)

  • Emphasis on responses to particular items versus performance across a complete sample of items as in Classical Test Theory.
  • Item Response Theory makes extensive use of item analysis techniques
  • Steps of Developing A Test Using Item Analysis and Item Response Theory
    • Define construct
    • Generate Initial Item Pool
    • Pilot testing
      • ~300 participants
    • Apply Item Analysis
      • Get item difficulty and item discriminability
      • Remove low quality items
    • Apply IRT
      • Select appropriate model
      • Make sure items fit the model well
      • Get Test Information Function to understand the test quality
  • Assumption: latent ability of test-taker is independent of the particular test.
  • Attempts to model the relationship between the probability of answering an item correctly (item difficulty) and the underlying ability of the test taker.
  • Assumes that it does not matter which items are used to estimate the test-taker's ability.
  • Thus, can compare results across examinees, even if exposed to different items.
  • Idea is to identify the probability of getting item X wrong at different overall ability levels.
  • Each item has its own item characteristic curve (where x-axis is now ability level, not total test score).
  • Method of administration using IRT: (see week 2)
    • Get first pass idea of overall ability level
    • Spend time discriminating around that estimated level
    • Score defined by level of difficulty of items examinee can answer correctly, NOT by total number of correct items
  • Advantages:
    • Efficient testing: don’t waste time on too easy/hard items
    • Examinee motivation: don’t expose to repetitive failure
    • Reduced likelihood of cheating: each examinee sees different items
  • Problems/Limits:
    • Do not help the students learn
    • Need very large pool of well characterized items
    • Requires computer administration

Test Administration

  • Various online survey platforms exist, such as Google Forms, SurveyMonkey, Qualtrics, and REDCap, each offering a variety of question/response types.
  • The way a test is administered can influence its result.
  • Test administration can be a source of error.
  • Administrator characteristics can influence the result.
  • Test-taker characteristics can influence the result.
    • E.g., confident versus high test-anxiety.
  • Test setting can influence the result:
    • E.g., quiet office versus at school assessment for a child with ADHD.

The Examiner-Examinee Relationship

  • Establish rapport to help the examinee.
  • Familiarity:
    • Children unfamiliar with the administrator did significantly worse on a reading test compared to children familiar with the administrator
    • 4 IQ point increase when the examiner was familiar with the test-taker, in general
    • 7.6 IQ point increase when familiarity and lower SES co-occurred

Stereotype Threat

  • Most people worry how they will perform on testing.
  • May be worse for groups victimized by negative stereotypes.
  • Awareness of negative stereotype may inhibit performance.
  • Instruction “this test captures male female differences”: men score higher, vs. Instruction “this test does not capture male/female differences”: men and women score comparably.
  • Students engage better with Uni if they think intelligence is malleable.
  • Stereotype threat:
    • Depletes working memory
    • Self-handicapping
    • Promotes physiological arousal
      • Arousal good, but only to a point.

Language of Test and Examinee

  • Linguistic demands can put ESL (English as a second language) and non-English speakers at a disadvantage.
  • Limited Vocabulary; Cultural Differences.
  • Even non-verbal tests generally require verbal instructions.
  • Tests in different languages.
  • Translation and back-translation.
  • Response scale – adjectives.
  • Reliability and validity of a given test is uncertain when:
    • Given to examinees whose 1st language is not the test language.
    • Test has been administered via a translation
Examples of Non-verbal Test
  • Raven’s Progressive Matrices
  • Wechsler Nonverbal Scale of Ability
  • The Goodenough-Harris draw-a-person test

Cultural Response Bias

  • Tendency for social factors to influence the way that people perceive and respond to survey questions
  • Cultural response bias is verified as contributing to the SWB difference

Expectancy Effects

  • Data affected by examiner’s expectations.
  • Confirmation bias: e.g. instruction to examiners shapes scoring (‘on average most will fail’ vs. ‘on average most will pass’).
  • Interpretation of ambiguous responses (e.g. ‘Similarities’).
  • Examiner’s perception of examinee:
    • Give credit if like the examinee (i.e. non-objectivity).
  • “Expectancy” literature is controversial, though worth noting.

Reinforcing Responses

  • Reinforcement affects behavior.
  • Examinee-relevant incentives can improve test scores.
  • Reward can improve performance especially in children.
  • Can increase symptom endorsement.
  • So, most test manuals insist no feedback be given.
  • Test manuals should give clear instructions, ensuring standardization (i.e. if feedback is given it is of a fixed type for all examinees).
  • Departure from these can violate reliability & validity.

Training

  • Administration and scoring errors are a large source of bias.
  • Typical graduate training: 2-4 administrations of a test (in class).
  • Importance of fieldwork placements (increasing experience with administration).
  • The majority of testing practice is obtained in fieldwork placements.
  • Error rates on WAIS administrations do not decrease until ~10 administrations have been completed!

Computer Administration

  • Advantages:
    • Excellent standardization.
    • Individually tailored sequence administration.
    • Precision of timing of responses.
    • Frees up examiner for other duties ($cost).
    • Patience (computer doesn’t get bored, examinee can’t be rushed by a restless examiner).
    • Controls bias.
    • Examinees are more likely to reveal ‘undesirable’ info (e.g., substance use; acknowledging that therapy is not helping).

Subject Variables

  • The state of the respondent can also be a major source of error when administering a test.
    • Illness
    • Insomnia
    • Test-anxiety (examples of stressed students, GP with memory complaint)
    • Drugs (prescription and recreational – caffeine!)