Psychological Measurement & Testing - Exam 2

5.0(1)

Studied by 4 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/151

Earn XP

Description and Tags

Psychology

University/Undergrad

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

152 Terms

New cards

Absolute Scores as a test score transformation

Raw test scores are transformed to be easily *compared to a fixed standard.*

New cards

What is item bias?

Items being difficult for one group of people than another for **reasons that have nothing to do with the construct.**

* E.g., males performing better on sports questions than females

New cards

What is response bias?

* Individual differences in response patterns that have nothing to do with the content of the test.
* E.g., using the extremes or the middle.
* Introduce (additional) error.

New cards

Examples of response bias

* __Acquiescence__ = tendency to agree with all items
* __Random responding__
* __Social desirability__ = people want to present themselves in a positive light

New cards

What is acquiescence, and can we do about it?

* The tendency to agree with all items.
* What to do: *balance positive and negative items.*

New cards

What are validity scales?

Special items in the test to detect which test takers are giving dishonest answers

New cards

Why do test users calculate standard scores?

Scores are *compared* to the distribution of scores for some particular reference group.

* Interpreted *relative* to the group norms – above average, below average, percentiles, etc
* *Provide more meaning to individual scores* and so that we can *compare individual scores with those of a previously tested group or norm group*.

New cards

What are linear transformations?

Transform raw scores so that they have a **particular mean and standard deviation.**

* Adding, subtracting, multiplying, and dividing scores by some set of constant values.
* The shape of the distribution stays the same.
* *Gives context and makes it easier to interpret a single score*

New cards

Examples of linear transformations

**z-Scores**

* Subtract the mean from the observed score.
* Divide by the standard deviation.
* Scores now have a mean of 0, SD of 1

**T-Scores**

* Now all the scores are positive, but we still have **exactly** the same information we had in our z-scores!

New cards

How to interpret z-scores

Helps us to understand **how many standard deviations an individual test score is above or below the distribution mean**

* The mean of a distribution of test scores = z score of 0.
* A z score of 1 = 1 standard deviation above the mean.
* A z score of −1 = 1 standard deviation below the mean.

New cards

How to interpret T-scores

Similar to z scores in that they help us understand how many standard deviations an individual test score is above or below the distribution mean.

* However, they always have **a mean of 50** and a **standard deviation of 10.**
* They are also **always positive**, unlike z scores.
* T-score of 60 = 1 standard deviation above the mean.
* T-score of 30 = 2 standard deviations below the mean.

New cards

Why use linear transformations?

* Help us **communicate** about what a test score means in **relative** terms.

* Help us **compare** across tests that have very different raw scores

New cards

What are area transformations?

Now we **are** changing the **shape** of the distribution.

* More complicated mathematically or procedurally – doing things other than 3rd grade math.
* Percentiles, Stanines,

\

New cards

What are percentiles?

* % of people who scored below the person we are interested in
* All the people who scored below + half of the people who scored exactly the same.

New cards

What are stanines?

A standard score scale with nine points that allows us to describe a distribution in words instead of numbers (from 1 = very poor to 9 = very superior)

* Simplified way to express percentile information
* Differentiates more between people in the middle than T-scores

New cards

Normative approach to scoring

The scores will be used to *compare test takers with other test takers*

* E.g., an employment test in which the applicant who achieves the highest score will receive the job offer

New cards

Criterion approach to scoring

The scores will be used to *indicate achievement*

* Must achieve a certain score to qualify as passing or excellent
* E.g., student performance using the letter grades A to F.

New cards

What are norms?

The average scores of some predefined group we want to compare to.

* Given to a clearly defined norm group.
* E.g., Colorado third-graders

New cards

Where do norms come from?

* Who does your norm group need to represent?
* All Colorado 3rd -graders?
* Denver district 3rd -graders?
* All 3rd -graders nationally?

New cards

Why is it so important to pay attention to the norm group when interpreting a test \n score?

Norms apply ***only*** to this group!

* If you send the test off for scoring and get back percentiles, stanines, t-scores, etc., those scores are probably *based on the norms* (they had to get a mean and standard deviation from somewhere!)
* ***Interpreting those scores then requires understanding of the norm group.***

New cards

Where would you find information about the norms for a published test?

In the test manual

New cards

What does it mean to have a representative sample, and why is it important for norming?

* **Representativeness** of the norm group is more or less important depending on who and what you are using the test for.
* High-stakes testing, diverse populations, etc. call for higher levels of representativeness.
* Representative norm groups require careful attention to sampling (ex: *ALL* Colorado 3rd graders?)

New cards

What is measurement error?

ALL measurement (even physical measurement) contains some error.

* We can’t *eliminate* all error in psychological testing, but we can **reduce it** and/or **account for it** when we use tests.

* In order to do this, though, we need to know **what kind** of error we’re dealing with and **how much**.

New cards

Where does measurement error come from?

* __Test__ __**construction**__
* item choice, item wording, etc.

* __Test__ __**administration**__
* temperature, time, lighting, administration errors, etc.
* __**Test-taker**__ __variables__
* test anxiety, amount of sleep, hunger, distraction, etc.
* __Scoring and interpretation__
* Differences among scorers – training, motivation, attention, etc.

New cards

What is the meaning of the classical test theory equation “X = T + E”?

Test score (X) = true score (T) + error (E)

New cards

f we know X, what do we need in order to find T and E?

A __**reliability coefficient**__ = (estimated) **proportion of test score variance that is due to true score variance.**

New cards

What is a reliability coefficient, in mathematical terms?

(estimated) **proportion of test score variance that is due to true score variance.**

* the **ratio of true variance to total variance**.

New cards

What are the four main approaches to testing reliability?

* Test-retest
* Alternate forms
* Internal consistency
* Scorer reliability

New cards

What kind of error is considered in alternate forms reliability?

Create two **alternate forms** of the same test.

* Would scores be different if we had **used a different version of the test? (**__Test__ __**construction)**__

New cards

What are parallel forms?

* Tests are *parallel* when they have equal means, variances, and reliability.
* Scores on different forms are interchangeable.
* Strict requirement – ***not all alternate forms are parallel forms.***

New cards

What kind of error is considered in internal consistency reliability?

* *Goal:* separate true score from error caused by idiosyncrasies in the questions. (__Test__ __**construction**__)

New cards

What does internal consistency reliability assume about your construct?

**Assumes** all of the items are measuring *one homogeneous construct.*

* High internal consistency is **not evidence that** all of your items measure the same thing – that would be circular!
* For heterogeneous (aka multidimensional!) tests – **need to estimate reliability separately for each component**.

New cards

You find that the internal consistency reliability of your new test, as measured by Cronbach’s Alpha, is only 0.60. Which of the following is the most likely explanation for this?

Your items are not homogenous - they measure more than 1 thing.

New cards

What is split-half reliability?

* Divide your test into two halves!
* Odd vs. even items
* Randomly
* *Matching items to create two mini alternate forms.*
* **Score each half separately** and **correlate the two scores**.
* Tells you how well the two sets of items go together.
* But… this is only the reliability for half the test!

New cards

What is the Spearman-Brown prophecy formula?

* **Estimates what the reliability of your test would be if you had more items.**
* This is reasonable in *split-half reliability* because all of the items came from the same original test.
* Formula:
* **(*****n*****)(*****reliability*****) / 1 + (*****n*** **-** **1)(*****reliability*****)**

*n* = number of items in the new version / items in the original.

New cards

Why would you use the Spearman-Brown prophecy formula?

To estimate the number of questions to add to a test so as to increase its reliability to the desired level.

New cards

What kind of error is considered in scorer reliability?

Goal: separate true score from error caused by differences in raters.

* Would scores be different if they came from a **different rater**? (__Scoring and interpretation__ )

New cards

What kind of error is considered in test-retest reliability?

* *Goal: separate true score from error caused by temporary factors.*
* Mood, time of day, distractions, etc. (__Test__ __**administration)**__

* Would scores be different if we had **measured at a different point in time?**

New cards

What does test-retest reliability assume about your construct?

That the true score is **stable**.

* This is not always a safe assumption!

New cards

Test-retest reliability: ***coefficient of stability***

* Give the same test to the same group of people at 2 different points in time.
* **Correlation** between Time 1 & Time 2 scores = *proportion of the variance due to true score.*
* What’s left over = proportion of the variance due to fluctuations over time.

New cards

What are the KR-20 and coefficient alpha formulas for?

They are ***indicators of i*__*nternal consistency*__*!***

Estimate the average of **all possible** split-half correlations.

* Based on all of the covariances among items.
* So your reliability coefficient is not influenced by how you split the halves!

New cards

You read a test manual that reports Cronbach’s alpha as a measure of internal consistency for a test with dichotomous items. Why is this a concern?

Internal consistency among *dichotomous* items is best measured using *KR-20.*

New cards

KR-20

For **dichotomous** (right vs. wrong) items

New cards

Cronbach’s alpha

For **rating-scale** type items

New cards

Which is better, split-half reliability or coefficient alpha/KR-20?

Split-half method is a **rough** estimate.

**KR-20** and **Cronbach’s alpha** are better!

New cards

Interrater Reliability

* *Goal:* separate true score from error caused by differences in raters.
* Would scores be different if they came from a **different rater**?
* This is only relevant if we **actually have** more than one rater

New cards

Interrater Reliability ≠ Agreement

* An interrater reliability correlation tells us that both raters put people in the **same rank order** – not that they gave the **same scores**.
* Reliability may be enough if we’re just doing research (correlating ratings with other variables).
* But if we are **making decisions** with these ratings, we really need **agreement**.

New cards

How do you calculate interscorer/interrater reliability?

* __Percent agreement__
* How often do raters give the same score?
* __Cohen’s kappa__
* How similar are the ratings of two different scorers?
* *I*__*ntra*____scorer agreement__
* How consistent are **one rater’s** ratings?

New cards

What statistics can you use to calculate interscorer agreement?

* KR-20 or alpha
* Treat multiple scores from one rater just like you would treat multiple items from one test.

New cards

What is intrascorer reliability?

Whether each scorer was consistent in the way he or she assigned scores from test to test.

New cards

How do you decide which kind of reliability to use?

Which kind(s) of error are you concerned about?

* *Test-retest*: temporary situational factors.
* *Alternate forms*: differences between forms.
* *Internal consistency*: quirks of the items.
* *Scorer*: differences between scorers or raters.

New cards

How high should a reliability coefficient be to be “good enough”

* **Depends on your purpose!**
* __For research__: most people will accept __above__ __**.70.**__
* __For making decisions about people__: some people recommend __above .90__.

* ***The higher the stakes, the higher your reliability should be.***

New cards

How is reliability affected by the number of items in the test?

More items = higher reliability (unless the items really don’t fit)

New cards

How is reliability affected by the length of time that elapses between test and retest (for test-retest reliability)?

Measurements closer together will be more closely correlated.

New cards

How is reliability affected by restriction of range?

**Little variance** in a variable = restricted range = **low correlation**.

New cards

How is reliability affected by speed tests (rather than power tests)?

The test-taker doesn’t usually finish all items, but usually gets most if not all of them right.

* ***Internal consistency isn’t appropriate here*** – we’re missing too much data on the later items.

New cards

What is the standard error of measurement (SEM)?

The standard deviation of an individual’s theoretical test score distribution.

New cards

How is the standard error of measurement similar to and different from the overall reliability of a test?

Reliability tells us what % of the **variance** in test scores is attributable to error.

* Not the same thing as the % of any one test score that is attributable to error (SEM)

New cards

Standard error of measurement equation

**SEM = (st. dev) x √(1 - r )**

New cards

Why would you want to know the standard error of measurement?

To better understand the amount of error in a test score

New cards

How would you use the SEM to calculate a confidence interval around a person’s test \n score?

**95% CI = X +/- 1.96(SEM)**

New cards

What is the standard error of the difference?

Tells us how different two scores need to be to be considered truly different.

New cards

The standard error of the difference equation

**SE(diff) = SD * √(2 – reliability 1 – reliability 2)**

New cards

When would you use the standard error of the difference?

Want to know if the observed difference is due to **real change** or to **fluctuations in measurement error**

New cards

Where does criterion-related validity evidence fit in the modern validity framework?

Connects to *relationships with other variables* validity

New cards

What is a criterion?

An important *outcome* or *result* of our construct.

* **Not** the same as our construct – distinct, though we expect them to be related.

New cards

Examples of a criterion

job performance, graduate school success, successful completion of treatment, etc.

New cards

How do we find evidence of criterion-related validity?

* Usually, **correlate** our test scores with the criterion (or criteria).

* Two main strategies:
* Predictive
* Concurrent

New cards

Criterion-Related Validity Coefficient

Usually: correlation between test scores and criterion!

* rxy

New cards

What is the coefficient of determination?

* Square the correlation coefficient - r^2xy
* % of variance accounted for.

New cards

Is there a minimum “acceptable” value for a validity coefficient or coefficient of determination?

* Remember that the significance of a correlation depends on the **sample size**.
* So “significance” alone isn’t a good standard for determining validity, so long as you had a big enough sample to get a good estimate of the correlation.
* **Compare** to other, similar measures – is your validity coefficient comparable?

New cards

How do you determine whether your validity coefficient is big enough?

* Do you have a big enough sample to get a good estimate of the correlation?
* Is your validity coefficient comparable to other, similar measures?

New cards

Does a statistically significant validity coefficient mean that your test is valid?

Statistical significance is **very dependent on sample size**, so *it is typically not a proper way of evaluating most things in psychological measurement.*

* While it can be a good start, you want to **cross-validate** that validity coefficient by correlating your test/related outcomes with different samples.

New cards

Concurrent Validity

* Compare test scores with the criterion at the *same time*.
* Describes the *present* – does not tell you about the future.

New cards

Predictive Validity

* Administer the test now, wait, then correlate test scores now with a criterion measured at some point in the future.

* Shows that the test does predict *future behavior*

New cards

Pros of Concurrent Validity

* Much faster!
* In selection, less risky in the short term.

New cards

Cons of Concurrent Validity

* Relationship may change over time.
* Doesn’t really tell you about the future

New cards

Pros of Predictive Validity

* Better information about predicting future outcomes.

New cards

Cons of Predictive Validity

* Need to be patient!
* May lose test-takers along the way

New cards

What is restriction of range?

Correlation between two variables is **weakened** when we don’t have much variability in one or both variables.

* In other words, when our participants don’t cover the full possible range of the variable.

New cards

What effect does restriction of range have on your validity coefficient?

We will ***underestimate*** *the validity of our test!*

New cards

What is cross-validation?

* One validity study does not guarantee your validity coefficient! So…
* **Test again with a different sample (or the other half of your original sample) – how similar is the validity coefficient?**

New cards

Why would you do a meta-analysis of validity studies?

Any one study may contain error or situation-specific factors – we can be more confident in the result of a meta-analysis.

New cards

When would you want to use more than one predictor?

For complex outcomes with several contributing factors.

New cards

What do we call it when we use multiple tests together to predict an outcome?

Test batteries

New cards

When you use multiple predictors, why is the combined validity coefficient always less than the sum of the individual validity coefficients?

INCREMENTAL VALIDITY

* There is a level of shared variance (it does NOT add more variance)
* As you add more tests, *the* *shared variance goes up* (not the individual variance)

New cards

What is incremental validity?

The amount of additional variance in a test battery that can be accounted for in the criterion measure by the addition of one or more additional tests to the test battery.

New cards

What does incremental validity tell us about a predictor?

Predictors that are **not** highly correlated with one another **do** increase our validity coefficient.

New cards

Why do criteria need to be reliable and valid?

Because many of the outcomes we care about are complex (multidimensional). Must get at all aspects of the construct

New cards

What does it mean if we say a criterion is deficient?

**Does not cover the whole outcome.**

* *Common solution*: use multiple or composite criteria.

New cards

What does it mean if we say a criterion is contaminated?

**Includes things in addition to the outcome we care about.**

* E.g., sales performance and attractiveness.

New cards

How does having a deficient or contaminated criterion affect our validity study and the \n conclusions we can draw?

* When unreliable or invalid, *the true validity coefficient might be under or over estimated*
* Important to think about criterion as well as predictor

New cards

Criterion-related validity compared to reliability

* Both rely on *correlations*.
* **Reliability** correlates *the test with itself*.
* *Criterion-related validity* correlates *the test with an outcome.*
* Reliability is a *necessary but not sufficient condition* for criterion-related validity.

New cards

You are attempting to obtain criterion-related validity evidence for a test of high school students at one point in time to collect your data.

What is the best validation study design given these circumstances?

Concurrent Validity!

* Compare test scores with the criterion at the *same time*. (present, not future).

New cards

Can you have a reliable test that is not valid?

Yes

New cards

Can you have a test that predicts well but is not reliable?

New cards

Criterion-related validity compared to appropriate content

* If the test really measures what we think it measures, these *should* go together… but that’s not always the case.
* We may write a well-designed, content valid test that doesn’t predict well.
* We might also write a test that predicts well but doesn’t appear to be related to our construct.
* *You* ***can*** *have one and not the other*

New cards

Is it possible that a test might predict an outcome well for one group of people but not another?

Yes!

* Issues of culture, translation, etc. raise the possibility that a test might predict well for one group but not another.
* Test users need to make sure the test is valid for ALL groups.

New cards

What is single-group validity?

A test ***predicts an outcome for one group but not at all for another.***

* Problematic… but this is very rare in practice.

100

New cards

What is differential validity?

***More valid for one group than another***.

* Bigger issue than single-group validity