Absolute Scores as a test score transformation
Raw test scores are transformed to be easily compared to a fixed standard.
What is item bias?
Items being difficult for one group of people than another for reasons that have nothing to do with the construct.
E.g., males performing better on sports questions than females
What is response bias?
Individual differences in response patterns that have nothing to do with the content of the test.
E.g., using the extremes or the middle.
Introduce (additional) error.
Examples of response bias
Acquiescence = tendency to agree with all items
Random responding
Social desirability = people want to present themselves in a positive light
What is acquiescence, and can we do about it?
The tendency to agree with all items.
What to do: balance positive and negative items.
What are validity scales?
Special items in the test to detect which test takers are giving dishonest answers
Why do test users calculate standard scores?
Scores are compared to the distribution of scores for some particular reference group.
Interpreted relative to the group norms – above average, below average, percentiles, etc
Provide more meaning to individual scores and so that we can compare individual scores with those of a previously tested group or norm group.
What are linear transformations?
Transform raw scores so that they have a particular mean and standard deviation.
Adding, subtracting, multiplying, and dividing scores by some set of constant values.
The shape of the distribution stays the same.
Gives context and makes it easier to interpret a single score
Examples of linear transformations
z-Scores
Subtract the mean from the observed score.
Divide by the standard deviation.
Scores now have a mean of 0, SD of 1
T-Scores
Now all the scores are positive, but we still have exactly the same information we had in our z-scores!
How to interpret z-scores
Helps us to understand how many standard deviations an individual test score is above or below the distribution mean
The mean of a distribution of test scores = z score of 0.
A z score of 1 = 1 standard deviation above the mean.
A z score of −1 = 1 standard deviation below the mean.
How to interpret T-scores
Similar to z scores in that they help us understand how many standard deviations an individual test score is above or below the distribution mean.
However, they always have a mean of 50 and a standard deviation of 10.
They are also always positive, unlike z scores.
T-score of 60 = 1 standard deviation above the mean.
T-score of 30 = 2 standard deviations below the mean.
Why use linear transformations?
Help us communicate about what a test score means in relative terms.
Help us compare across tests that have very different raw scores
What are area transformations?
Now we are changing the shape of the distribution.
More complicated mathematically or procedurally – doing things other than 3rd grade math.
Percentiles, Stanines,
What are percentiles?
% of people who scored below the person we are interested in
All the people who scored below + half of the people who scored exactly the same.
What are stanines?
A standard score scale with nine points that allows us to describe a distribution in words instead of numbers (from 1 = very poor to 9 = very superior)
Simplified way to express percentile information
Differentiates more between people in the middle than T-scores
Normative approach to scoring
The scores will be used to compare test takers with other test takers
E.g., an employment test in which the applicant who achieves the highest score will receive the job offer
Criterion approach to scoring
The scores will be used to indicate achievement
Must achieve a certain score to qualify as passing or excellent
E.g., student performance using the letter grades A to F.
What are norms?
The average scores of some predefined group we want to compare to.
Given to a clearly defined norm group.
E.g., Colorado third-graders
Where do norms come from?
Who does your norm group need to represent?
All Colorado 3rd -graders?
Denver district 3rd -graders?
All 3rd -graders nationally?
Why is it so important to pay attention to the norm group when interpreting a test \n score?
Norms apply only to this group!
If you send the test off for scoring and get back percentiles, stanines, t-scores, etc., those scores are probably based on the norms (they had to get a mean and standard deviation from somewhere!)
Interpreting those scores then requires understanding of the norm group.
Where would you find information about the norms for a published test?
In the test manual
What does it mean to have a representative sample, and why is it important for norming?
Representativeness of the norm group is more or less important depending on who and what you are using the test for.
High-stakes testing, diverse populations, etc. call for higher levels of representativeness.
Representative norm groups require careful attention to sampling (ex: ALL Colorado 3rd graders?)
What is measurement error?
ALL measurement (even physical measurement) contains some error.
We can’t eliminate all error in psychological testing, but we can reduce it and/or account for it when we use tests.
In order to do this, though, we need to know what kind of error we’re dealing with and how much.
Where does measurement error come from?
Test construction
item choice, item wording, etc.
Test administration
temperature, time, lighting, administration errors, etc.
Test-taker variables
test anxiety, amount of sleep, hunger, distraction, etc.
Scoring and interpretation
Differences among scorers – training, motivation, attention, etc.
What is the meaning of the classical test theory equation “X = T + E”?
Test score (X) = true score (T) + error (E)
f we know X, what do we need in order to find T and E?
A reliability coefficient = (estimated) proportion of test score variance that is due to true score variance.
What is a reliability coefficient, in mathematical terms?
(estimated) proportion of test score variance that is due to true score variance.
the ratio of true variance to total variance.
What are the four main approaches to testing reliability?
Test-retest
Alternate forms
Internal consistency
Scorer reliability
What kind of error is considered in alternate forms reliability?
Create two alternate forms of the same test.
Would scores be different if we had used a different version of the test? (Test construction)
What are parallel forms?
Tests are parallel when they have equal means, variances, and reliability.
Scores on different forms are interchangeable.
Strict requirement – not all alternate forms are parallel forms.
What kind of error is considered in internal consistency reliability?
Goal: separate true score from error caused by idiosyncrasies in the questions. (Test construction)
What does internal consistency reliability assume about your construct?
Assumes all of the items are measuring one homogeneous construct.
High internal consistency is not evidence that all of your items measure the same thing – that would be circular!
For heterogeneous (aka multidimensional!) tests – need to estimate reliability separately for each component.
You find that the internal consistency reliability of your new test, as measured by Cronbach’s Alpha, is only 0.60. Which of the following is the most likely explanation for this?
Your items are not homogenous - they measure more than 1 thing.
What is split-half reliability?
Divide your test into two halves!
Odd vs. even items
Randomly
Matching items to create two mini alternate forms.
Score each half separately and correlate the two scores.
Tells you how well the two sets of items go together.
But… this is only the reliability for half the test!
What is the Spearman-Brown prophecy formula?
Estimates what the reliability of your test would be if you had more items.
This is reasonable in split-half reliability because all of the items came from the same original test.
Formula:
**(n)(reliability) / 1 + (**n - 1)(reliability)
n = number of items in the new version / items in the original.
Why would you use the Spearman-Brown prophecy formula?
To estimate the number of questions to add to a test so as to increase its reliability to the desired level.
What kind of error is considered in scorer reliability?
Goal: separate true score from error caused by differences in raters.
Would scores be different if they came from a different rater? (Scoring and interpretation )
What kind of error is considered in test-retest reliability?
Goal: separate true score from error caused by temporary factors.
Mood, time of day, distractions, etc. (Test administration)
Would scores be different if we had measured at a different point in time?
What does test-retest reliability assume about your construct?
That the true score is stable.
This is not always a safe assumption!
Test-retest reliability: coefficient of stability
Give the same test to the same group of people at 2 different points in time.
Correlation between Time 1 & Time 2 scores = proportion of the variance due to true score.
What’s left over = proportion of the variance due to fluctuations over time.
What are the KR-20 and coefficient alpha formulas for?
They are indicators of internal consistency!**
Estimate the average of all possible split-half correlations.
Based on all of the covariances among items.
So your reliability coefficient is not influenced by how you split the halves!
You read a test manual that reports Cronbach’s alpha as a measure of internal consistency for a test with dichotomous items. Why is this a concern?
Internal consistency among dichotomous items is best measured using KR-20.
KR-20
For dichotomous (right vs. wrong) items
Cronbach’s alpha
For rating-scale type items
Which is better, split-half reliability or coefficient alpha/KR-20?
Split-half method is a rough estimate.
KR-20 and Cronbach’s alpha are better!
Interrater Reliability
Goal: separate true score from error caused by differences in raters.
Would scores be different if they came from a different rater?
This is only relevant if we actually have more than one rater
Interrater Reliability ≠ Agreement
An interrater reliability correlation tells us that both raters put people in the same rank order – not that they gave the same scores.
Reliability may be enough if we’re just doing research (correlating ratings with other variables).
But if we are making decisions with these ratings, we really need agreement.
How do you calculate interscorer/interrater reliability?
Percent agreement
How often do raters give the same score?
Cohen’s kappa
How similar are the ratings of two different scorers?
I__ntra__scorer agreement
How consistent are one rater’s ratings?
What statistics can you use to calculate interscorer agreement?
KR-20 or alpha
Treat multiple scores from one rater just like you would treat multiple items from one test.
What is intrascorer reliability?
Whether each scorer was consistent in the way he or she assigned scores from test to test.
How do you decide which kind of reliability to use?
Which kind(s) of error are you concerned about?
Test-retest: temporary situational factors.
Alternate forms: differences between forms.
Internal consistency: quirks of the items.
Scorer: differences between scorers or raters.
How high should a reliability coefficient be to be “good enough”
Depends on your purpose!
For research: most people will accept above .70.
For making decisions about people: some people recommend above .90.
The higher the stakes, the higher your reliability should be.
How is reliability affected by the number of items in the test?
More items = higher reliability (unless the items really don’t fit)
How is reliability affected by the length of time that elapses between test and retest (for test-retest reliability)?
Measurements closer together will be more closely correlated.
How is reliability affected by restriction of range?
Little variance in a variable = restricted range = low correlation.
How is reliability affected by speed tests (rather than power tests)?
The test-taker doesn’t usually finish all items, but usually gets most if not all of them right.
Internal consistency isn’t appropriate here – we’re missing too much data on the later items.
What is the standard error of measurement (SEM)?
The standard deviation of an individual’s theoretical test score distribution.
How is the standard error of measurement similar to and different from the overall reliability of a test?
Reliability tells us what % of the variance in test scores is attributable to error.
Not the same thing as the % of any one test score that is attributable to error (SEM)
Standard error of measurement equation
SEM = (st. dev) x √(1 - r )
Why would you want to know the standard error of measurement?
To better understand the amount of error in a test score
How would you use the SEM to calculate a confidence interval around a person’s test \n score?
95% CI = X +/- 1.96(SEM)
What is the standard error of the difference?
Tells us how different two scores need to be to be considered truly different.
The standard error of the difference equation
SE(diff) = SD * √(2 – reliability 1 – reliability 2)
When would you use the standard error of the difference?
Want to know if the observed difference is due to real change or to fluctuations in measurement error
Where does criterion-related validity evidence fit in the modern validity framework?
Connects to relationships with other variables validity
What is a criterion?
An important outcome or result of our construct.
Not the same as our construct – distinct, though we expect them to be related.
Examples of a criterion
job performance, graduate school success, successful completion of treatment, etc.
How do we find evidence of criterion-related validity?
Usually, correlate our test scores with the criterion (or criteria).
Two main strategies:
Predictive
Concurrent
Criterion-Related Validity Coefficient
Usually: correlation between test scores and criterion!
rxy
What is the coefficient of determination?
Square the correlation coefficient - r^2xy
% of variance accounted for.
Is there a minimum “acceptable” value for a validity coefficient or coefficient of determination?
Remember that the significance of a correlation depends on the sample size.
So “significance” alone isn’t a good standard for determining validity, so long as you had a big enough sample to get a good estimate of the correlation.
Compare to other, similar measures – is your validity coefficient comparable?
How do you determine whether your validity coefficient is big enough?
Do you have a big enough sample to get a good estimate of the correlation?
Is your validity coefficient comparable to other, similar measures?
Does a statistically significant validity coefficient mean that your test is valid?
Statistical significance is very dependent on sample size, so it is typically not a proper way of evaluating most things in psychological measurement.
While it can be a good start, you want to cross-validate that validity coefficient by correlating your test/related outcomes with different samples.
Concurrent Validity
Compare test scores with the criterion at the same time.
Describes the present – does not tell you about the future.
Predictive Validity
Administer the test now, wait, then correlate test scores now with a criterion measured at some point in the future.
Shows that the test does predict future behavior
Pros of Concurrent Validity
Much faster!
In selection, less risky in the short term.
Cons of Concurrent Validity
Relationship may change over time.
Doesn’t really tell you about the future
Pros of Predictive Validity
Better information about predicting future outcomes.
Cons of Predictive Validity
Need to be patient!
May lose test-takers along the way
What is restriction of range?
Correlation between two variables is weakened when we don’t have much variability in one or both variables.
In other words, when our participants don’t cover the full possible range of the variable.
What effect does restriction of range have on your validity coefficient?
We will underestimate the validity of our test!
What is cross-validation?
One validity study does not guarantee your validity coefficient! So…
Test again with a different sample (or the other half of your original sample) – how similar is the validity coefficient?
Why would you do a meta-analysis of validity studies?
Any one study may contain error or situation-specific factors – we can be more confident in the result of a meta-analysis.
When would you want to use more than one predictor?
For complex outcomes with several contributing factors.
What do we call it when we use multiple tests together to predict an outcome?
Test batteries
When you use multiple predictors, why is the combined validity coefficient always less than the sum of the individual validity coefficients?
INCREMENTAL VALIDITY
There is a level of shared variance (it does NOT add more variance)
As you add more tests, the shared variance goes up (not the individual variance)
What is incremental validity?
The amount of additional variance in a test battery that can be accounted for in the criterion measure by the addition of one or more additional tests to the test battery.
What does incremental validity tell us about a predictor?
Predictors that are not highly correlated with one another do increase our validity coefficient.
Why do criteria need to be reliable and valid?
Because many of the outcomes we care about are complex (multidimensional). Must get at all aspects of the construct
What does it mean if we say a criterion is deficient?
Does not cover the whole outcome.
Common solution: use multiple or composite criteria.
What does it mean if we say a criterion is contaminated?
Includes things in addition to the outcome we care about.
E.g., sales performance and attractiveness.
How does having a deficient or contaminated criterion affect our validity study and the \n conclusions we can draw?
When unreliable or invalid, the true validity coefficient might be under or over estimated
Important to think about criterion as well as predictor
Criterion-related validity compared to reliability
Both rely on correlations.
Reliability correlates the test with itself.
Criterion-related validity correlates the test with an outcome.
Reliability is a necessary but not sufficient condition for criterion-related validity.
You are attempting to obtain criterion-related validity evidence for a test of high school students at one point in time to collect your data.
What is the best validation study design given these circumstances?
Concurrent Validity!
Compare test scores with the criterion at the same time. (present, not future).
Can you have a reliable test that is not valid?
Yes
Can you have a test that predicts well but is not reliable?
No
Criterion-related validity compared to appropriate content
If the test really measures what we think it measures, these should go together… but that’s not always the case.
We may write a well-designed, content valid test that doesn’t predict well.
We might also write a test that predicts well but doesn’t appear to be related to our construct.
You can have one and not the other
Is it possible that a test might predict an outcome well for one group of people but not another?
Yes!
Issues of culture, translation, etc. raise the possibility that a test might predict well for one group but not another.
Test users need to make sure the test is valid for ALL groups.
What is single-group validity?
A test predicts an outcome for one group but not at all for another.
Problematic… but this is very rare in practice.
What is differential validity?
More valid for one group than another.
Bigger issue than single-group validity