Psychological Measurement & Testing - Exam 2

5.0(1)
studied byStudied by 4 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/151

flashcard set

Earn XP

Description and Tags

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

152 Terms

1
New cards
Absolute Scores as a test score transformation
Raw test scores are transformed to be easily *compared to a fixed standard.*
2
New cards
What is item bias?
Items being difficult for one group of people than another for **reasons that have nothing to do with the construct.**

* E.g., males performing better on sports questions than females
3
New cards
What is response bias?
* Individual differences in response patterns that have nothing to do with the content of the test. 
* E.g., using the extremes or the middle. 
* Introduce (additional) error.
4
New cards
Examples of response bias
* __Acquiescence__ = tendency to agree with all items
* __Random responding__
* __Social desirability__ = people want to present themselves in a positive light
5
New cards
What is acquiescence, and can we do about it?
* The tendency to agree with all items. 
* What to do: *balance positive and negative items.*
6
New cards
What are validity scales?
Special items in the test to detect which test takers are giving dishonest answers
7
New cards
Why do test users calculate standard scores?
Scores are *compared* to the distribution of scores for some particular reference group.

* Interpreted *relative* to the group norms – above average, below average, percentiles, etc
* *Provide more meaning to individual scores* and so that we can *compare individual scores with those of a previously tested group or norm group*.
8
New cards
What are linear transformations?
Transform raw scores so that they have a **particular mean and standard deviation.** 

* Adding, subtracting, multiplying, and dividing scores by some set of constant values. 
* The shape of the distribution stays the same.
* *Gives context and makes it easier to interpret a single score*
9
New cards
Examples of linear transformations
**z-Scores**

* Subtract the mean from the observed score. 
* Divide by the standard deviation. 
* Scores now have a mean of 0, SD of 1

**T-Scores**

* Now all the scores are positive, but we still have **exactly** the same information we had in our z-scores!
10
New cards
How to interpret z-scores
Helps us to understand **how many standard deviations an individual test score is above or below the distribution mean**

* The mean of a distribution of test scores = z score of 0.
* A z score of 1 = 1 standard deviation above the mean.
* A z score of −1 = 1 standard deviation below the mean.
11
New cards
How to interpret T-scores
Similar to z scores in that they help us understand how many standard deviations an individual test score is above or below the distribution mean.

* However, they always have **a mean of 50** and a **standard deviation of 10.**
* They are also **always positive**, unlike z scores.
* T-score of 60 = 1 standard deviation above the mean.
* T-score of 30 = 2 standard deviations below the mean.
12
New cards
Why use linear transformations?
* Help us **communicate** about what a test score means in **relative** terms. 


* Help us **compare** across tests that have very different raw scores
13
New cards
What are area transformations?
Now we **are** changing the **shape** of the distribution. 

* More complicated mathematically or procedurally – doing things other than 3rd grade math.
* Percentiles, Stanines,

\
14
New cards
What are percentiles?
* % of people who scored below the person we are interested in
* All the people who scored below + half of the people who scored exactly the same.
15
New cards
What are stanines?
A standard score scale with nine points that allows us to describe a distribution in words instead of numbers (from 1 = very poor to 9 = very superior)

* Simplified way to express percentile information
* Differentiates more between people in the middle than T-scores
16
New cards
Normative approach to scoring
The scores will be used to *compare test takers with other test takers*

* E.g., an employment test in which the applicant who achieves the highest score will receive the job offer
17
New cards
Criterion approach to scoring
The scores will be used to *indicate achievement*

* Must achieve a certain score to qualify as passing or excellent
* E.g., student performance using the letter grades A to F.
18
New cards
What are norms?
The average scores of some predefined group we want to compare to. 

*  Given to a clearly defined norm group. 
* E.g., Colorado third-graders
19
New cards
Where do norms come from?
* Who does your norm group need to represent? 
* All Colorado 3rd -graders? 
* Denver district 3rd -graders? 
* All 3rd -graders nationally?
20
New cards
Why is it so important to pay attention to the norm group when interpreting a test \n score?
Norms apply ***only*** to this group!

* If you send the test off for scoring and get back percentiles, stanines, t-scores, etc., those scores are probably *based on the norms* (they had to get a mean and standard deviation from somewhere!) 
* ***Interpreting those scores then requires understanding of the norm group.*** 
21
New cards
Where would you find information about the norms for a published test?
In the test manual
22
New cards
What does it mean to have a representative sample, and why is it important for norming?
* **Representativeness** of the norm group is more or less important depending on who and what you are using the test for.  
* High-stakes testing, diverse populations, etc. call for higher levels of representativeness.
* Representative norm groups require careful attention to sampling (ex: *ALL* Colorado 3rd graders?)
23
New cards
What is measurement error?
ALL measurement (even physical measurement) contains some error.

* We can’t *eliminate* all error in psychological testing, but we can **reduce it** and/or **account for it** when we use tests. 


* In order to do this, though, we need to know **what kind** of error we’re dealing with and **how much**.
24
New cards
Where does measurement error come from?
* __Test__ __**construction**__
* item choice, item wording, etc.


* __Test__ __**administration**__
* temperature, time, lighting, administration errors, etc.
* __**Test-taker**__ __variables__
* test anxiety, amount of sleep, hunger, distraction, etc.
* __Scoring and interpretation__ 
* Differences among scorers – training, motivation, attention, etc.
25
New cards
What is the meaning of the classical test theory equation “X = T + E”?
Test score (X) = true score (T) + error (E)
26
New cards
f we know X, what do we need in order to find T and E?
A __**reliability coefficient**__ = (estimated) **proportion of test score variance that is due to true score variance.** 
27
New cards
What is a reliability coefficient, in mathematical terms?
(estimated) **proportion of test score variance that is due to true score variance.** 

* the **ratio of true variance to total variance**.
28
New cards
What are the four main approaches to testing reliability?
* Test-retest 
* Alternate forms 
* Internal consistency 
* Scorer reliability
29
New cards
What kind of error is considered in alternate forms reliability?
Create two **alternate forms** of the same test. 

* Would scores be different if we had **used a different version of the test? (**__Test__ __**construction)**__
30
New cards
What are parallel forms?
* Tests are *parallel* when they have equal means, variances, and reliability. 
* Scores on different forms are interchangeable. 
* Strict requirement – ***not all alternate forms are parallel forms.*** 
31
New cards
What kind of error is considered in internal consistency reliability?
* *Goal:* separate true score from error caused by idiosyncrasies in the questions. (__Test__ __**construction**__)
32
New cards
What does internal consistency reliability assume about your construct?
**Assumes** all of the items are measuring *one homogeneous construct.*

* High internal consistency is **not evidence that** all of your items measure the same thing – that would be circular!
* For heterogeneous (aka multidimensional!) tests – **need to estimate reliability separately for each component**.
33
New cards
You find that the internal consistency reliability of your new test, as measured by Cronbach’s Alpha, is only 0.60. Which of the following is the most likely explanation for this?
Your items are not homogenous - they measure more than 1 thing.
34
New cards
What is split-half reliability?
* Divide your test into two halves! 
* Odd vs. even items 
* Randomly 
* *Matching items to create two mini alternate forms.* 
* **Score each half separately** and **correlate the two scores**. 
* Tells you how well the two sets of items go together. 
* But… this is only the reliability for half the test! 
35
New cards
What is the Spearman-Brown prophecy formula?
* **Estimates what the reliability of your test would be if you had more items.** 
* This is reasonable in *split-half reliability* because all of the items came from the same original test. 
* Formula:
* **(*****n*****)(*****reliability*****) / 1 + (*****n*** **-** **1)(*****reliability*****)**

*n* = number of items in the new version / items in the original.
36
New cards
Why would you use the Spearman-Brown prophecy formula?
To estimate the number of questions to add to a test so as to increase its reliability to the desired level.
37
New cards
What kind of error is considered in scorer reliability?
Goal: separate true score from error caused by differences in raters.

* Would scores be different if they came from a **different rater**? (__Scoring and interpretation__ )
38
New cards
What kind of error is considered in test-retest reliability?
* *Goal: separate true score from error caused by temporary factors.* 
* Mood, time of day, distractions, etc. (__Test__ __**administration)**__


* Would scores be different if we had **measured at a different point in time?**
39
New cards
What does test-retest reliability assume about your construct?
That the true score is **stable**. 

* This is not always a safe assumption!
40
New cards
Test-retest reliability: ***coefficient of stability***
* Give the same test to the same group of people at 2 different points in time. 
* **Correlation** between Time 1 & Time 2 scores = *proportion of the variance due to true score.*
* What’s left over = proportion of the variance due to fluctuations over time. 
41
New cards
What are the KR-20 and coefficient alpha formulas for?
They are ***indicators of i*__*nternal consistency*__*!***

Estimate the average of **all possible** split-half correlations. 

* Based on all of the covariances among items. 
* So your reliability coefficient is not influenced by how you split the halves!
42
New cards
You read a test manual that reports Cronbach’s alpha as a measure of internal consistency for a test with dichotomous items. Why is this a concern?
Internal consistency among *dichotomous* items is best measured using *KR-20.*
43
New cards
KR-20
For **dichotomous** (right vs. wrong) items
44
New cards
Cronbach’s alpha
For **rating-scale** type items 
45
New cards
Which is better, split-half reliability or coefficient alpha/KR-20?
Split-half method is a **rough** estimate.  

​​**KR-20** and **Cronbach’s alpha** are better!
46
New cards
Interrater Reliability
* *Goal:* separate true score from error caused by differences in raters.
* Would scores be different if they came from a **different rater**? 
* This is only relevant if we **actually have** more than one rater
47
New cards
Interrater Reliability ≠ Agreement
* An interrater reliability correlation tells us that both raters put people in the **same rank order** – not that they gave the **same scores**. 
* Reliability may be enough if we’re just doing research (correlating ratings with other variables). 
* But if we are **making decisions** with these ratings, we really need **agreement**. 
48
New cards
How do you calculate interscorer/interrater reliability?
* __Percent agreement__
* How often do raters give the same score? 
* __Cohen’s kappa__ 
* How similar are the ratings of two different scorers? 
* *I*__*ntra*____scorer agreement__ 
* How consistent are **one rater’s** ratings? 
49
New cards
What statistics can you use to calculate interscorer agreement?
* KR-20 or alpha
* Treat multiple scores from one rater just like you would treat multiple items from one test. 
50
New cards
What is intrascorer reliability?
Whether each scorer was consistent in the way he or she assigned scores from test to test.
51
New cards
How do you decide which kind of reliability to use?
Which kind(s) of error are you concerned about? 

* *Test-retest*: temporary situational factors. 
* *Alternate forms*: differences between forms. 
* *Internal consistency*: quirks of the items. 
* *Scorer*: differences between scorers or raters. 
52
New cards
How high should a reliability coefficient be to be “good enough”
* **Depends on your purpose!** 
* __For research__: most people will accept __above__ __**.70.**__ 
* __For making decisions about people__: some people recommend __above .90__. 


* ***The higher the stakes, the higher your reliability should be.***
53
New cards
How is reliability affected by the number of items in the test?
More items = higher reliability (unless the items really don’t fit)
54
New cards
How is reliability affected by the length of time that elapses between test and retest (for test-retest reliability)?
Measurements closer together will be more closely correlated. 
55
New cards
How is reliability affected by restriction of range?
**Little variance** in a variable = restricted range = **low correlation**.
56
New cards
How is reliability affected by speed tests (rather than power tests)?
The test-taker doesn’t usually finish all items, but usually gets most if not all of them right. 

* ***Internal consistency isn’t appropriate here*** – we’re missing too much data on the later items.
57
New cards
What is the standard error of measurement (SEM)?
The standard deviation of an individual’s theoretical test score distribution. 
58
New cards
How is the standard error of measurement similar to and different from the overall reliability of a test?
Reliability tells us what % of the **variance** in test scores is attributable to error. 

* Not the same thing as the % of any one test score that is attributable to error (SEM)
59
New cards
Standard error of measurement equation
**SEM = (st. dev) x √(1 - r )**
60
New cards
Why would you want to know the standard error of measurement?
To better understand the amount of error in a test score
61
New cards
How would you use the SEM to calculate a confidence interval around a person’s test \n score?
 **95% CI = X +/- 1.96(SEM)** 
62
New cards
What is the standard error of the difference?
Tells us how different two scores need to be to be considered truly different.
63
New cards
The standard error of the difference equation
**SE(diff) = SD * √(2 – reliability 1 – reliability 2)** 
64
New cards
When would you use the standard error of the difference?
Want to know if the observed difference is due to **real change** or to **fluctuations in measurement error**
65
New cards
Where does criterion-related validity evidence fit in the modern validity framework?
Connects to *relationships with other variables* validity
66
New cards
What is a criterion?
An important *outcome* or *result* of our construct. 

* **Not** the same as our construct – distinct, though we expect them to be related. 
67
New cards
Examples of a criterion
job performance, graduate school success, successful completion of treatment, etc.
68
New cards
How do we find evidence of criterion-related validity?
* Usually, **correlate** our test scores with the criterion (or criteria).


* Two main strategies: 
* Predictive 
* Concurrent
69
New cards
Criterion-Related Validity Coefficient
Usually: correlation between test scores and criterion!

* rxy
70
New cards
What is the coefficient of determination?
* Square the correlation coefficient - r^2xy
* % of variance accounted for.
71
New cards
Is there a minimum “acceptable” value for a validity coefficient or coefficient of determination?
* Remember that the significance of a correlation depends on the **sample size**.
* So “significance” alone isn’t a good standard for determining validity, so long as you had a big enough sample to get a good estimate of the correlation. 
* **Compare** to other, similar measures – is your validity coefficient comparable? 
72
New cards
How do you determine whether your validity coefficient is big enough?
* Do you have a big enough sample to get a good estimate of the correlation?
* Is your validity coefficient comparable to other, similar measures? 
73
New cards
Does a statistically significant validity coefficient mean that your test is valid?
Statistical significance is **very dependent on sample size**, so *it is typically not a proper way of evaluating most things in psychological measurement.*

* While it can be a good start, you want to **cross-validate** that validity coefficient by correlating your test/related outcomes with different samples.
74
New cards
Concurrent Validity
* Compare test scores with the criterion at the *same time*. 
* Describes the *present* – does not tell you about the future.
75
New cards
Predictive Validity
* Administer the test now, wait, then correlate test scores now with a criterion measured at some point in the future. 


* Shows that the test does predict *future behavior*
76
New cards
Pros of Concurrent Validity
* Much faster! 
* In selection, less risky in the short term.
77
New cards
Cons of Concurrent Validity
* Relationship may change over time.
* Doesn’t really tell you about the future
78
New cards
Pros of Predictive Validity
* Better information about predicting future outcomes. 
79
New cards
Cons of Predictive Validity
* Need to be patient! 
* May lose test-takers along the way
80
New cards
What is restriction of range?
Correlation between two variables is **weakened** when we don’t have much variability in one or both variables. 

* In other words, when our participants don’t cover the full possible range of the variable.
81
New cards
What effect does restriction of range have on your validity coefficient?
We will ***underestimate*** *the validity of our test!* 
82
New cards
What is cross-validation?
* One validity study does not guarantee your validity coefficient! So…
* **Test again with a different sample (or the other half of your original sample) – how similar is the validity coefficient?**
83
New cards
Why would you do a meta-analysis of validity studies?
Any one study may contain error or situation-specific factors – we can be more confident in the result of a meta-analysis.
84
New cards
When would you want to use more than one predictor?
For complex outcomes with several contributing factors. 
85
New cards
What do we call it when we use multiple tests together to predict an outcome?
Test batteries
86
New cards
When you use multiple predictors, why is the combined validity coefficient always less than the sum of the individual validity coefficients?
INCREMENTAL VALIDITY

* There is a level of shared variance (it does NOT add more variance)
* As you add more tests, *the* *shared variance goes up* (not the individual variance)
87
New cards
What is incremental validity?
The amount of additional variance in a test battery that can be accounted for in the criterion measure by the addition of one or more additional tests to the test battery.
88
New cards
What does incremental validity tell us about a predictor?
Predictors that are **not** highly correlated with one another **do** increase our validity coefficient. 
89
New cards
Why do criteria need to be reliable and valid?
Because many of the outcomes we care about are complex (multidimensional). Must get at all aspects of the construct
90
New cards
What does it mean if we say a criterion is deficient?
**Does not cover the whole outcome.** 

* *Common solution*: use multiple or composite criteria. 
91
New cards
What does it mean if we say a criterion is contaminated?
**Includes things in addition to the outcome we care about.** 

* E.g., sales performance and attractiveness. 
92
New cards
How does having a deficient or contaminated criterion affect our validity study and the \n conclusions we can draw?
* When unreliable or invalid, *the true validity coefficient might be under or over estimated*
* Important to think about criterion as well as predictor
93
New cards
Criterion-related validity compared to reliability
* Both rely on *correlations*. 
* **Reliability** correlates *the test with itself*. 
* *Criterion-related validity* correlates *the test with an outcome.*
* Reliability is a *necessary but not sufficient condition* for criterion-related validity. 
94
New cards
You are attempting to obtain criterion-related validity evidence for a test of high school students at one point in time to collect your data.

What is the best validation study design given these circumstances?
Concurrent Validity!

* Compare test scores with the criterion at the *same time*. (present, not future).
95
New cards
Can you have a reliable test that is not valid?
Yes
96
New cards
Can you have a test that predicts well but is not reliable?
No
97
New cards
Criterion-related validity compared to appropriate content
* If the test really measures what we think it measures, these *should* go together… but that’s not always the case. 
* We may write a well-designed, content valid test that doesn’t predict well.
* We might also write a test that predicts well but doesn’t appear to be related to our construct. 
* *You* ***can*** *have one and not the other*
98
New cards
Is it possible that a test might predict an outcome well for one group of people but not another?
Yes!

* Issues of culture, translation, etc. raise the possibility that a test might predict well for one group but not another. 
* Test users need to make sure the test is valid for ALL groups. 
99
New cards
What is single-group validity?
A test ***predicts an outcome for one group but not at all for another.*** 

* Problematic… but this is very rare in practice.
100
New cards
What is differential validity?
***More valid for one group than another***.

* Bigger issue than single-group validity