Psychological Measurement & Testing - Exam 2

studied byStudied by 1 person
5.0(1)
Get a hint
Hint

Absolute Scores as a test score transformation

1 / 151

flashcard set

Earn XP

Description and Tags

152 Terms

1

Absolute Scores as a test score transformation

Raw test scores are transformed to be easily compared to a fixed standard.

New cards
2

What is item bias?

Items being difficult for one group of people than another for reasons that have nothing to do with the construct.

  • E.g., males performing better on sports questions than females

New cards
3

What is response bias?

  • Individual differences in response patterns that have nothing to do with the content of the test.

    • E.g., using the extremes or the middle.

  • Introduce (additional) error.

New cards
4

Examples of response bias

  • Acquiescence = tendency to agree with all items

  • Random responding

  • Social desirability = people want to present themselves in a positive light

New cards
5

What is acquiescence, and can we do about it?

  • The tendency to agree with all items.

    • What to do: balance positive and negative items.

New cards
6

What are validity scales?

Special items in the test to detect which test takers are giving dishonest answers

New cards
7

Why do test users calculate standard scores?

Scores are compared to the distribution of scores for some particular reference group.

  • Interpreted relative to the group norms – above average, below average, percentiles, etc

  • Provide more meaning to individual scores and so that we can compare individual scores with those of a previously tested group or norm group.

New cards
8

What are linear transformations?

Transform raw scores so that they have a particular mean and standard deviation.

  • Adding, subtracting, multiplying, and dividing scores by some set of constant values.

  • The shape of the distribution stays the same.

  • Gives context and makes it easier to interpret a single score

New cards
9

Examples of linear transformations

z-Scores

  • Subtract the mean from the observed score.

  • Divide by the standard deviation.

  • Scores now have a mean of 0, SD of 1

T-Scores

  • Now all the scores are positive, but we still have exactly the same information we had in our z-scores!

New cards
10

How to interpret z-scores

Helps us to understand how many standard deviations an individual test score is above or below the distribution mean

  • The mean of a distribution of test scores = z score of 0.

  • A z score of 1 = 1 standard deviation above the mean.

  • A z score of −1 = 1 standard deviation below the mean.

New cards
11

How to interpret T-scores

Similar to z scores in that they help us understand how many standard deviations an individual test score is above or below the distribution mean.

  • However, they always have a mean of 50 and a standard deviation of 10.

  • They are also always positive, unlike z scores.

  • T-score of 60 = 1 standard deviation above the mean.

  • T-score of 30 = 2 standard deviations below the mean.

New cards
12

Why use linear transformations?

  • Help us communicate about what a test score means in relative terms.

  • Help us compare across tests that have very different raw scores

New cards
13

What are area transformations?

Now we are changing the shape of the distribution.

  • More complicated mathematically or procedurally – doing things other than 3rd grade math.

  • Percentiles, Stanines,

New cards
14

What are percentiles?

  • % of people who scored below the person we are interested in

    • All the people who scored below + half of the people who scored exactly the same.

New cards
15

What are stanines?

A standard score scale with nine points that allows us to describe a distribution in words instead of numbers (from 1 = very poor to 9 = very superior)

  • Simplified way to express percentile information

  • Differentiates more between people in the middle than T-scores

New cards
16

Normative approach to scoring

The scores will be used to compare test takers with other test takers

  • E.g., an employment test in which the applicant who achieves the highest score will receive the job offer

New cards
17

Criterion approach to scoring

The scores will be used to indicate achievement

  • Must achieve a certain score to qualify as passing or excellent

  • E.g., student performance using the letter grades A to F.

New cards
18

What are norms?

The average scores of some predefined group we want to compare to.

  • Given to a clearly defined norm group.

    • E.g., Colorado third-graders

New cards
19

Where do norms come from?

  • Who does your norm group need to represent?

    • All Colorado 3rd -graders?

    • Denver district 3rd -graders?

    • All 3rd -graders nationally?

New cards
20

Why is it so important to pay attention to the norm group when interpreting a test \n score?

Norms apply only to this group!

  • If you send the test off for scoring and get back percentiles, stanines, t-scores, etc., those scores are probably based on the norms (they had to get a mean and standard deviation from somewhere!)

  • Interpreting those scores then requires understanding of the norm group.

New cards
21

Where would you find information about the norms for a published test?

In the test manual

New cards
22

What does it mean to have a representative sample, and why is it important for norming?

  • Representativeness of the norm group is more or less important depending on who and what you are using the test for.

    • High-stakes testing, diverse populations, etc. call for higher levels of representativeness.

    • Representative norm groups require careful attention to sampling (ex: ALL Colorado 3rd graders?)

New cards
23

What is measurement error?

ALL measurement (even physical measurement) contains some error.

  • We can’t eliminate all error in psychological testing, but we can reduce it and/or account for it when we use tests.

  • In order to do this, though, we need to know what kind of error we’re dealing with and how much.

New cards
24

Where does measurement error come from?

  • Test construction

    • item choice, item wording, etc.

  • Test administration

    • temperature, time, lighting, administration errors, etc.

  • Test-taker variables

    • test anxiety, amount of sleep, hunger, distraction, etc.

  • Scoring and interpretation

    • Differences among scorers – training, motivation, attention, etc.

New cards
25

What is the meaning of the classical test theory equation “X = T + E”?

Test score (X) = true score (T) + error (E)

New cards
26

f we know X, what do we need in order to find T and E?

A reliability coefficient = (estimated) proportion of test score variance that is due to true score variance.

New cards
27

What is a reliability coefficient, in mathematical terms?

(estimated) proportion of test score variance that is due to true score variance.

  • the ratio of true variance to total variance.

New cards
28

What are the four main approaches to testing reliability?

  • Test-retest

  • Alternate forms

  • Internal consistency

  • Scorer reliability

New cards
29

What kind of error is considered in alternate forms reliability?

Create two alternate forms of the same test.

  • Would scores be different if we had used a different version of the test? (Test construction)

New cards
30

What are parallel forms?

  • Tests are parallel when they have equal means, variances, and reliability.

    • Scores on different forms are interchangeable.

    • Strict requirement – not all alternate forms are parallel forms.

New cards
31

What kind of error is considered in internal consistency reliability?

  • Goal: separate true score from error caused by idiosyncrasies in the questions. (Test construction)

New cards
32

What does internal consistency reliability assume about your construct?

Assumes all of the items are measuring one homogeneous construct.

  • High internal consistency is not evidence that all of your items measure the same thing – that would be circular!

  • For heterogeneous (aka multidimensional!) tests – need to estimate reliability separately for each component.

New cards
33

You find that the internal consistency reliability of your new test, as measured by Cronbach’s Alpha, is only 0.60. Which of the following is the most likely explanation for this?

Your items are not homogenous - they measure more than 1 thing.

New cards
34

What is split-half reliability?

  • Divide your test into two halves!

    • Odd vs. even items

    • Randomly

    • Matching items to create two mini alternate forms.

  • Score each half separately and correlate the two scores.

  • Tells you how well the two sets of items go together.

  • But… this is only the reliability for half the test!

New cards
35

What is the Spearman-Brown prophecy formula?

  • Estimates what the reliability of your test would be if you had more items.

    • This is reasonable in split-half reliability because all of the items came from the same original test.

  • Formula:

    • **(n)(reliability) / 1 + (**n - 1)(reliability)

n = number of items in the new version / items in the original.

New cards
36

Why would you use the Spearman-Brown prophecy formula?

To estimate the number of questions to add to a test so as to increase its reliability to the desired level.

New cards
37

What kind of error is considered in scorer reliability?

Goal: separate true score from error caused by differences in raters.

  • Would scores be different if they came from a different rater? (Scoring and interpretation )

New cards
38

What kind of error is considered in test-retest reliability?

  • Goal: separate true score from error caused by temporary factors.

    • Mood, time of day, distractions, etc. (Test administration)

  • Would scores be different if we had measured at a different point in time?

New cards
39

What does test-retest reliability assume about your construct?

That the true score is stable.

  • This is not always a safe assumption!

New cards
40

Test-retest reliability: coefficient of stability

  • Give the same test to the same group of people at 2 different points in time.

  • Correlation between Time 1 & Time 2 scores = proportion of the variance due to true score.

    • What’s left over = proportion of the variance due to fluctuations over time.

New cards
41

What are the KR-20 and coefficient alpha formulas for?

They are indicators of internal consistency!**

Estimate the average of all possible split-half correlations.

  • Based on all of the covariances among items.

  • So your reliability coefficient is not influenced by how you split the halves!

New cards
42

You read a test manual that reports Cronbach’s alpha as a measure of internal consistency for a test with dichotomous items. Why is this a concern?

Internal consistency among dichotomous items is best measured using KR-20.

New cards
43

KR-20

For dichotomous (right vs. wrong) items

New cards
44

Cronbach’s alpha

For rating-scale type items

New cards
45

Which is better, split-half reliability or coefficient alpha/KR-20?

Split-half method is a rough estimate.

​​KR-20 and Cronbach’s alpha are better!

New cards
46

Interrater Reliability

  • Goal: separate true score from error caused by differences in raters.

    • Would scores be different if they came from a different rater?

    • This is only relevant if we actually have more than one rater

New cards
47

Interrater Reliability ≠ Agreement

  • An interrater reliability correlation tells us that both raters put people in the same rank order – not that they gave the same scores.

  • Reliability may be enough if we’re just doing research (correlating ratings with other variables).

  • But if we are making decisions with these ratings, we really need agreement.

New cards
48

How do you calculate interscorer/interrater reliability?

  • Percent agreement

    • How often do raters give the same score?

  • Cohen’s kappa

    • How similar are the ratings of two different scorers?

  • I__ntra__scorer agreement

    • How consistent are one rater’s ratings?

New cards
49

What statistics can you use to calculate interscorer agreement?

  • KR-20 or alpha

    • Treat multiple scores from one rater just like you would treat multiple items from one test.

New cards
50

What is intrascorer reliability?

Whether each scorer was consistent in the way he or she assigned scores from test to test.

New cards
51

How do you decide which kind of reliability to use?

Which kind(s) of error are you concerned about?

  • Test-retest: temporary situational factors.

  • Alternate forms: differences between forms.

  • Internal consistency: quirks of the items.

  • Scorer: differences between scorers or raters.

New cards
52

How high should a reliability coefficient be to be “good enough”

  • Depends on your purpose!

    • For research: most people will accept above .70.

    • For making decisions about people: some people recommend above .90.

  • The higher the stakes, the higher your reliability should be.

New cards
53

How is reliability affected by the number of items in the test?

More items = higher reliability (unless the items really don’t fit)

New cards
54

How is reliability affected by the length of time that elapses between test and retest (for test-retest reliability)?

Measurements closer together will be more closely correlated.

New cards
55

How is reliability affected by restriction of range?

Little variance in a variable = restricted range = low correlation.

New cards
56

How is reliability affected by speed tests (rather than power tests)?

The test-taker doesn’t usually finish all items, but usually gets most if not all of them right.

  • Internal consistency isn’t appropriate here – we’re missing too much data on the later items.

New cards
57

What is the standard error of measurement (SEM)?

The standard deviation of an individual’s theoretical test score distribution.

New cards
58

How is the standard error of measurement similar to and different from the overall reliability of a test?

Reliability tells us what % of the variance in test scores is attributable to error.

  • Not the same thing as the % of any one test score that is attributable to error (SEM)

New cards
59

Standard error of measurement equation

SEM = (st. dev) x √(1 - r )

New cards
60

Why would you want to know the standard error of measurement?

To better understand the amount of error in a test score

New cards
61

How would you use the SEM to calculate a confidence interval around a person’s test \n score?

95% CI = X +/- 1.96(SEM)

New cards
62

What is the standard error of the difference?

Tells us how different two scores need to be to be considered truly different.

New cards
63

The standard error of the difference equation

SE(diff) = SD * √(2 – reliability 1 – reliability 2)

New cards
64

When would you use the standard error of the difference?

Want to know if the observed difference is due to real change or to fluctuations in measurement error

New cards
65

Where does criterion-related validity evidence fit in the modern validity framework?

Connects to relationships with other variables validity

New cards
66

What is a criterion?

An important outcome or result of our construct.

  • Not the same as our construct – distinct, though we expect them to be related.

New cards
67

Examples of a criterion

job performance, graduate school success, successful completion of treatment, etc.

New cards
68

How do we find evidence of criterion-related validity?

  • Usually, correlate our test scores with the criterion (or criteria).

  • Two main strategies:

    • Predictive

    • Concurrent

New cards
69

Criterion-Related Validity Coefficient

Usually: correlation between test scores and criterion!

  • rxy

New cards
70

What is the coefficient of determination?

  • Square the correlation coefficient - r^2xy

  • % of variance accounted for.

New cards
71

Is there a minimum “acceptable” value for a validity coefficient or coefficient of determination?

  • Remember that the significance of a correlation depends on the sample size.

    • So “significance” alone isn’t a good standard for determining validity, so long as you had a big enough sample to get a good estimate of the correlation.

  • Compare to other, similar measures – is your validity coefficient comparable?

New cards
72

How do you determine whether your validity coefficient is big enough?

  • Do you have a big enough sample to get a good estimate of the correlation?

  • Is your validity coefficient comparable to other, similar measures?

New cards
73

Does a statistically significant validity coefficient mean that your test is valid?

Statistical significance is very dependent on sample size, so it is typically not a proper way of evaluating most things in psychological measurement.

  • While it can be a good start, you want to cross-validate that validity coefficient by correlating your test/related outcomes with different samples.

New cards
74

Concurrent Validity

  • Compare test scores with the criterion at the same time.

  • Describes the present – does not tell you about the future.

New cards
75

Predictive Validity

  • Administer the test now, wait, then correlate test scores now with a criterion measured at some point in the future.

  • Shows that the test does predict future behavior

New cards
76

Pros of Concurrent Validity

  • Much faster!

  • In selection, less risky in the short term.

New cards
77

Cons of Concurrent Validity

  • Relationship may change over time.

  • Doesn’t really tell you about the future

New cards
78

Pros of Predictive Validity

  • Better information about predicting future outcomes.

New cards
79

Cons of Predictive Validity

  • Need to be patient!

  • May lose test-takers along the way

New cards
80

What is restriction of range?

Correlation between two variables is weakened when we don’t have much variability in one or both variables.

  • In other words, when our participants don’t cover the full possible range of the variable.

New cards
81

What effect does restriction of range have on your validity coefficient?

We will underestimate the validity of our test!

New cards
82

What is cross-validation?

  • One validity study does not guarantee your validity coefficient! So…

    • Test again with a different sample (or the other half of your original sample) – how similar is the validity coefficient?

New cards
83

Why would you do a meta-analysis of validity studies?

Any one study may contain error or situation-specific factors – we can be more confident in the result of a meta-analysis.

New cards
84

When would you want to use more than one predictor?

For complex outcomes with several contributing factors.

New cards
85

What do we call it when we use multiple tests together to predict an outcome?

Test batteries

New cards
86

When you use multiple predictors, why is the combined validity coefficient always less than the sum of the individual validity coefficients?

INCREMENTAL VALIDITY

  • There is a level of shared variance (it does NOT add more variance)

  • As you add more tests, the shared variance goes up (not the individual variance)

New cards
87

What is incremental validity?

The amount of additional variance in a test battery that can be accounted for in the criterion measure by the addition of one or more additional tests to the test battery.

New cards
88

What does incremental validity tell us about a predictor?

Predictors that are not highly correlated with one another do increase our validity coefficient.

New cards
89

Why do criteria need to be reliable and valid?

Because many of the outcomes we care about are complex (multidimensional). Must get at all aspects of the construct

New cards
90

What does it mean if we say a criterion is deficient?

Does not cover the whole outcome.

  • Common solution: use multiple or composite criteria.

New cards
91

What does it mean if we say a criterion is contaminated?

Includes things in addition to the outcome we care about.

  • E.g., sales performance and attractiveness.

New cards
92

How does having a deficient or contaminated criterion affect our validity study and the \n conclusions we can draw?

  • When unreliable or invalid, the true validity coefficient might be under or over estimated

  • Important to think about criterion as well as predictor

New cards
93

Criterion-related validity compared to reliability

  • Both rely on correlations.

    • Reliability correlates the test with itself.

    • Criterion-related validity correlates the test with an outcome.

  • Reliability is a necessary but not sufficient condition for criterion-related validity.

New cards
94

You are attempting to obtain criterion-related validity evidence for a test of high school students at one point in time to collect your data.

What is the best validation study design given these circumstances?

Concurrent Validity!

  • Compare test scores with the criterion at the same time. (present, not future).

New cards
95

Can you have a reliable test that is not valid?

Yes

New cards
96

Can you have a test that predicts well but is not reliable?

No

New cards
97

Criterion-related validity compared to appropriate content

  • If the test really measures what we think it measures, these should go together… but that’s not always the case.

    • We may write a well-designed, content valid test that doesn’t predict well.

    • We might also write a test that predicts well but doesn’t appear to be related to our construct.

  • You can have one and not the other

New cards
98

Is it possible that a test might predict an outcome well for one group of people but not another?

Yes!

  • Issues of culture, translation, etc. raise the possibility that a test might predict well for one group but not another.

  • Test users need to make sure the test is valid for ALL groups.

New cards
99

What is single-group validity?

A test predicts an outcome for one group but not at all for another.

  • Problematic… but this is very rare in practice.

New cards
100

What is differential validity?

More valid for one group than another.

  • Bigger issue than single-group validity

New cards

Explore top notes

note Note
studied byStudied by 9 people
Updated ... ago
5.0 Stars(2)
note Note
studied byStudied by 9 people
Updated ... ago
5.0 Stars(2)
note Note
studied byStudied by 28 people
Updated ... ago
4.7 Stars(3)
note Note
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 9 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 37 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 327 people
Updated ... ago
5.0 Stars(1)

Explore top flashcards

flashcards Flashcard59 terms
studied byStudied by 17 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard29 terms
studied byStudied by 2 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard108 terms
studied byStudied by 5 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard45 terms
studied byStudied by 4 people
Updated ... ago
4.0 Stars(1)
flashcards Flashcard28 terms
studied byStudied by 15 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard56 terms
studied byStudied by 8 people
Updated ... ago
4.0 Stars(1)
flashcards Flashcard73 terms
studied byStudied by 14 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard146 terms
studied byStudied by 11 people
Updated ... ago
5.0 Stars(1)