PSYCH 492: Final

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/132

There's no tags or description

Looks like no tags are added yet.

Last updated 1:37 AM on 4/17/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

133 Terms

New cards

How does test reliability influence the interpretation of an individual’s test score?

Test reliability tells us how much error is in an individual’s score. We use it to calculate the Standard Error of Measurement (SEM), which estimates the amount of measurement error.

Then, we use SEM to create a confidence interval around the observed score, giving a range where the true score likely falls.

New cards

What is Standard Error of Measurement?

SEM is the standard deviation of the errors in measurement
SEM gives us an indication of the amount of variability in an individual’s test score that is due to measurement error (in the test)

New cards

Why is test reliability important in real-world decision making?

Test reliability directly affects how we interpret an individual’s observed score.

Low reliability = more measurement error → less confidence in scores
High reliability = more precise and trustworthy scores

Because of this, tests used to make decisions about people (e.g., hiring, admissions) should have reliability ≥ .90 to ensure accuracy and fairness.

New cards

True or false: a measure can't be valid if it is not reliable

True.

Conceptually we need reliability to show validity

New cards

How does reliability influence correlations and validity?

Reliability limits correlations and validity.

Low reliability (high error) → weaker correlations
A measure cannot correlate well with others if it is unreliable
Therefore, validity is limited by reliability

Conceptually:

A test cannot be valid if it is not reliable

New cards

When do we use the maximum observed correlation formula?

We use this formula to determine the highest possible correlation between two measures, given their reliabilities.

It shows the upper limit of the observed correlation
Helps explain why correlations may be lower than expected
Used to assess whether a low correlation is due to low reliability

New cards

What is the correction for attenuation formula and when is it used?

The correction for attenuation formula estimates what the correlation would be if both variables were perfectly reliable (no measurement error). It reveals the true/latent relationship between the variables.

New cards

How does reliability influence results of multiple regression?

Reliability affects regression because it limits correlations between variables.

Low reliability → weaker correlations
Weaker correlations → predictors may appear non-significant
Results can be misleading due to measurement error

New cards

Why might a predictor appear non-significant in regression?

Because of low reliability.

High measurement error reduces correlations
This can make a real relationship appear insignificant

New cards

Why do we report reliability (rₓₓ) in correlation/regression tables?

To show the quality of the measures.

Helps interpret results correctly
Low reliability may explain weak or null findings

New cards

What is the risk of low reliability in regression analysis?

It can lead to incorrect conclusions (e.g., thinking a predictor doesn’t matter when it actually does)

New cards

How does reliability influence composite scores and how does reliability increase?

Composite scores are usually more reliable than individual tests.
Reliability increases when:

There are more tests
Tests are reliable
Tests are highly correlated

New cards

What is the reliability of a composite score? (Formula)

Number of tests (k)
Average reliability of tests
Average correlation among tests

New cards

Why do correlated tests increase composite reliability ?

Because they measure the same underlying construct, making the combined score more consistent.

New cards

What happens if tests in a composite are not correlated?

Composite reliability decreases because the tests are measuring different things.

New cards

Can highly reliable tests still produce a weaker composite?

Yes, if the tests are not correlated, the composite reliability is lower.

New cards

What are the two common situations for reliability of observed behaviour?

Judge’s ratings (continuous data)
Coding behaviour (nominal data)

A teacher rates a student’s participation on a scale from 1–5 (judge’s rating), while another observer records whether the student spoke in class (yes/no) (coding behaviour).

New cards

How do we assess reliability of coded behaviour?

Use Cohen’s Kappa, which measures agreement beyond chance.

New cards

What is the key difference between alpha and kappa?

Alpha → continuous ratings (agreement in degree)
Kappa → categorical coding (agreement beyond chance)

New cards

What does reliability of observed behaviour measure?

The consistency/agreement between raters or coders

New cards

What is the average inter-item correlation in judge ratings?

The average correlation between all raters.

Ex: Three judges rate the same people. If Judge A and B correlate at .80, A and C at .70, and B and C at .75, the average inter-item correlation is about .75 — meaning the judges generally agree in their ratings.

New cards

What formula is used to calculate reliability of judge ratings?

Cronbach’s alpha

Treat raters as items and use Cronbach’s alpha to assess consistency

New cards

What does α = .81 mean in judge ratings?

There is good agreement between raters, meaning ratings are reliable.

New cards

How do we assess reliability when coding behaviour (nominal data) and why can’t we just use percentage agreement?

Use Cohen’s Kappa, which measures agreement beyond chance.

We can’t rely on percent agreement because some agreement occurs by chance, which overestimates reliability.

New cards

What does Cohen’s Kappa measure?

The proportion of agreement between raters after removing chance agreement.

Example: Two teachers rate whether students are “on-task” or “off-task.” They agree 80% of the time, but 50% agreement could happen by chance—so Cohen’s Kappa measures the agreement beyond that chance portion.

New cards

What do the diagonal cells in a coding table represent?

Instances where both raters agreed on the same category.

New cards

What are the two components needed to calculate Kappa?

Observed agreement (P)
Chance agreement (Pc)

New cards

How do we interpret Kappa values?

New cards

How do you calculate chance agreement (Pc) in Cohen’s Kappa?

The probabilities come from when calculating Pc from the marginal totals (row % and column % for each category).

New cards

What does chance agreement (Pc) represent?

The proportion of agreement expected by chance alone.

New cards

Why do we calculate Pc in Kappa?

To remove agreement due to chance and get a more accurate measure of reliability.

New cards

Why can high percentage agreement be misleading?

Because some agreement occurs by chance.
If chance agreement is high, the true reliability may actually be low.

New cards

How can coding data help improve reliability?

By identifying where raters disagree, allowing researchers to refine categories and coding rules.

New cards

What is item difficulty?

Proportion of respondents with correct response
Most relevant for intelligence tests

New cards

What does p = .20 and p = .80 mean?

20% of respondents answered correctly → the item is difficult.
80% of respondents answered correctly → the item is easy.

New cards

What is the range of item difficulty (p), and is item difficulty (p) the same as a p-value?

0 to 1

No, item difficulty is a proportion, not a significance test.

New cards

What is the optimal item difficulty?

Approximately p = .50 (range ≈ .30–.70), where item discrimination and variance are maximized.

New cards

Why is p = .50 optimal?

Because it maximizes variability and allows the item to best distinguish between high and low ability individuals.

New cards

What happens if an item is too easy or too difficult?

It reduces variability and discrimination because most people get the same answer.

New cards

How does item difficulty affect discrimination?

Moderate difficulty (p ≈ .50) leads to maximum discrimination across ability levels.

New cards

What are the two qualifiers of optimal item difficulty?

Guessing – inflates item difficulty, so p must be adjusted
Purpose of the test – optimal difficulty depends on whether you want to detect low or high ability

New cards

How does guessing affect item difficulty?

When guessing is possible, item difficulty (p) is artificially inflated.

Some correct answers come from chance
Therefore, p must be adjusted upward based on number of answer choices

👉 More options = less guessing effect
👉 Fewer options = more guessing effect

New cards

How do answer choices affect guessing and item difficulty?

More answer choices → lower probability of guessing correctly
Fewer choices → higher probability of guessing correctly
👉 This increases the observed item difficulty (p)

New cards

How does the purpose of a test affect item difficulty?

To detect low ability → use easy items (high p ≈ .80)
To detect high ability → use difficult items (low p ≈ .20)

New cards

What is distractor analysis in multiple-choice questions?

It evaluates whether incorrect answer choices attract students who answer the item incorrectly.

New cards

How do you evaluate whether distractors are effective?

Compare actual selections to expected (even distribution).

Good distractor → chosen by some students
Bad distractor → rarely chosen

New cards

When is item variability maximized?

When the mean endorsement is near the midpoint of the scale, and responses are well spread (high SD).

Simplified:
When answers are spread out and not all the same (around the middle of the scale).

Example:
On a 1–7 scale, if people give answers from 2 to 6 (not all 1s or all 7s), variability is high.

New cards

What are mean endorsement and item variability?

Mean endorsement = average response to an item (level of agreement)
Item variability = how spread out responses are (measured by SD)

New cards

When is item variability maximized?

When mean item endorsement is around the midpoint of scale

New cards

What are floor and ceiling effects?

Floor effect → responses cluster at the low end
- Example: Mean = 1.6, SD = .60
Ceiling effect → responses cluster at the high end
- Example: Mean = 6.0, SD = .40

👉 Both indicate low variability and poor discrimination

New cards

Why is a midpoint mean not enough to ensure good item variability?

Because responses may still be clustered, you must also check standard deviation (SD).

New cards

Why are low-variability items problematic?

They do not differentiate between individuals and contribute little to total test variance.

New cards

What is the item discrimination index (D)?

It measures how well an item distinguishes between high and low scorers.

High D → good item (high scorers correct, low scorers incorrect)
Low or negative D → poor item

New cards

How do you interpret values of the item discrimination index (D)?

+1 → perfect discrimination (ideal item)
0 → no discrimination
-1 → negative discrimination (bad item; low scorers outperform high scorers)

New cards

How do item difficulty (p) and discrimination (D) relate?

Item difficulty limits discrimination
Maximum discrimination occurs when p ≈ .50
But p ≈ .50 does not guarantee good discrimination

New cards

What are Ebel’s guidelines for interpreting item discrimination (D)?

D ≥ .40 → optimal
.30–.39 → acceptable
.20–.29 → marginal
≤ .19 → poor

👉 D should always be positive

New cards

What is item-total/rest correlation (Rᵢₜ)?

The correlation between an item and the rest of the test.
👉 Indicates how well the item fits the overall scale

New cards

How do you interpret item-total correlation values?

High → good item (fits test)
Low or near 0 → poor item (does not measure same construct)the

New cards

How do inter-item correlations help evaluate test items?

They show how strongly items relate to each other.
👉 Items with low correlations with others likely do not measure the same construct and may need revision or removal.

New cards

What is content saturation in item analysis?

It refers to how well an item measures the intended construct.
👉 Good items correlate highly with the relevant scale and weakly with irrelevant scales.

New cards

How do you evaluate social desirability effects at the item level?

By correlating each item with:

the relevant scale (should be high)
a social desirability scale (should be low)

New cards

What does an item characteristic curve (ICC) show?

It shows the relationship between a person’s level of an attribute (ability) and their probability of answering an item correctly.
👉 As ability increases, probability of correct response increases

New cards

What is validity?

The extent to which a test measures the construct it is intended to measure.

New cards

How is validity different from reliability?

Reliability = consistency
Validity = accuracy (measuring the correct construct)

New cards

What are the main types of validity evidence?

Face validity
Content validity
Criterion-related validity
Construct validity

New cards

What is face validity?

The extent to which a test appears to measure what it is intended to measure.

New cards

What is the difference between high and low face validity?

High → items clearly reflect the construct
Low → items are indirect or unclear

New cards

Why is face validity useful but limited?

It improves acceptability and motivation but does not guarantee true validity.

New cards

What is content validity?

The extent to which a test adequately represents all aspects of the construct being measured.

New cards

How is content validity assessed?

Define the construct
Experts match items to the construct
Compare to see if items adequately cover it

New cards

What does low content validity indicate?

That the test does not fully represent the construct (important aspects are missing).

New cards

What is criterion-related validity?

The extent to which a test predicts or correlates with a real-world outcome (criterion).

New cards

What is the difference between concurrent and predictive validity?

Concurrent → measured at the same time
Predictive → measured at a later time

New cards

How is criterion-related validity assessed?

By calculating the correlation between test scores and criterion performance.

New cards

What does r² represent in criterion validity?

The proportion of variance in the outcome explained by the test.

New cards

What determines whether criterion validity is predictive or concurrent?

The timing of when the criterion is measured (later vs same time), not the statistical method used.

New cards

Does using regression mean a study is predictive validity?

No. Regression can be used in both predictive and concurrent validity. Timing determines the type.

New cards

Are criterion validity coefficients typically very high?

No. They are often modest (e.g., .20–.30), especially in real-world settings.

New cards

What is restricted range?

A reduction in score variability because only a limited portion of the full range is observed.

New cards

How does restricted range affect correlations?

It lowers correlations, making relationships appear weaker than they actually are.

New cards

Why does restricted range occur in predictive validity studies?

Because only selected individuals (e.g., top applicants) are followed up, reducing variability.

New cards

What is construct validity?

The extent to which a test truly measures the theoretical construct it is intended to measure.

New cards

What is convergent validity?

Evidence that a test correlates highly with other measures of the same or related constructs.

New cards

What is discriminant validity?

Evidence that a test does not correlate strongly with unrelated constructs.

New cards

What is a nomological network?

A set of expected relationships between a construct and other variables used to evaluate construct validity.

New cards

How do researchers evaluate construct validity using a nomological network?

By examining whether the measure correlates with related constructs (convergent validity) and does not correlate with unrelated constructs (discriminant validity).

New cards

What magnitude of correlations is expected for convergent vs discriminant validity?

Convergent validity → high correlations
Discriminant validity → low (or non-significant) correlations

New cards

Why is relative difference more important than absolute magnitude in construct validity?

Because correlation sizes depend on the study; what matters is that correlations with related constructs are higher than with unrelated constructs.

New cards

What is incremental validity?

“The extra predictive value a measure adds beyond existing predictors.”

Example (to remember it):
SAT predicts GPA → adding a personality test improves prediction → that improvement = incremental validity.

New cards

Why is incremental validity important in research?

It ensures new measures contribute additional knowledge and helps prevent unnecessary construct proliferation.

New cards

What is the difference between criterion-related validity and test utility?

Criterion-related validity → how well a test predicts a criterion
Test utility → how useful the test is for making real-world decisions

New cards

What does test utility refer to?

The extent to which a test improves the accuracy of decisions (e.g., hiring or selection).

New cards

What is the base rate (BR)?

The proportion of people in the population who would succeed on the criterion.

New cards

What is Selection Ratio (SR)

Proportion of applicants selected relative to those applying

New cards

Taylor and Russel (1939)

Tables that give the expected proportion of successful selections from using a
test as a function of varying:
1. Test validity
2. SR
3. BR on the criterion

New cards

What is a true positive vs false positive?

True positive → accept someone who succeeds
False positive → accept someone who fails

New cards

Why do we consider the four decision outcomes?

To evaluate the usefulness (utility) of a test by examining all possible decision consequences.

New cards

Why was Hull wrong about test usefulness?

Because test usefulness depends on context; even tests with low validity can be useful, and tests with high validity may not be useful.

New cards

What is factor analysis?

A method that groups related variables into a smaller number of underlying factors.

Example:
Questions about stress, worry, and nervousness all group into one factor: anxiety.

100

New cards

What is a latent factor?

An unobserved (hidden) variable that explains patterns of correlations among observed variables.