Exam 2: Psychological Assessment

Validity: measuring what you think

Reliability: how close to the true score are we likely to be

Practically: does it make sense to apply it to this setting? Is it worth it? Related to utility

Cross-Sectional Fairness: is it accurate for someone in this group? Across other tests do you get the same answer. Absence of test or assessor biases

Correlation: is the degree of the relationship between 2 variables
Separate from causation
(relationship could reflect causation, but need experiment to know for sure)
Negative, Positive, Strong and Weak correlations
No Correlation: no relationship; no pattern, random, no meaningful predictors
Info about scatter plots:
Some graphs don’t have a line of best fit because the data has no correlation

The closer to -1 or +1, the STRONGER the relationship better a your prediction because it’s more accurate
The closer to 0 the weaker the relationship!

Correlation does not equal causation!!

3 types of validity:

2. Criterion related validity: relationships between scores and other measures. Do the scores predict the performance on a criterion?

Crit= outcome measure of interest
Crit pre-requisites: must be VALID and noncontaminated (meaning it’s independent can't share any items)
Two Types of Criterion Related Validity:

Concurrent Validity: can predict score now (very quickly) Ex: 100 clients take the Beck Depression Inventory Test (BDI). 500 people take the alcohol test

Predictive Validity: can predict scores/performance in the future. Ex: using your current college gpa to determine your insurance claim. Basing it on potential risk of you crashing based on how low or high your gpa is. SAT scores correlated w/ college GPA. GRE correlated to graduation rate.

Both are about predicting
Predictive validity related stats: expectancy tables and standard error estimate (check the handout)

SEest: average distance from the regression line

High r: strong relationship between measure (test score) and what you are predicting. no one is that far off, predict score on the line and you will be close

Low r: weak relationship is between measure (test score) and what you are predicting

No r: no relationship between measures

Standards fo prediction varies
Set cut off scores that optimize ‘hit rate’ for situation

Hit rate: hits/ hits + misses

Hit: accurate predicted classification
Miss: false negatives and false positives
False positives: predict high/pass/trait but not
False negatives: predict the absence of something but it‘s actually there

Construct Validity

Construct: theoretical, intangible quality people vary on (ex:intelligence, leadership, psychopathy, anxiety, hostility, and self esteem

We infer that these qualities are real and that they exist. We try to group together predictable patterns of behavioral characteristics over related items
(construct is the assumed reason for the pattern)
Construct validity asks…is your quality measurable? and is this an accurate measure of it?
This is broader in comparison to content or criterion validity
Convergent Validity: scores highly as expected with other tests (positive or negative)
Ex: on older, established tests of contract o retaliated measures
Discrimination Validity: scores show little or no relationship to those that the theory predicts they should not be relatable

Reliability is about consistency: do you get the same results after each test?

It implies that there’s very little error and it’s near to the true score

Reliability Coefficient: a stat that quantifies reliability. Ranges from 0 (not reliable) to 1 (reliable)

Classic test Theory: Spearman 1904. It response theory probability of getting an item correct should be related to item difficulty and overall skill level

*** Variance of T/ variance of T + error variance

Reliability is a measure of the variability of true scores divided by the variability of the observed scores. (True + Error)
Closer to a value of zero
Random error can be unpredictable: environmental problems (temperature), examine state (sleepy, to feeling well), administration error, rapport issues, test score errors, judgment errors
Standardization (define it)

Different Ways to Measure Reliability:

Test Retest (The gold standard): correlates core from the same test given at different time
Possible issues…practice effects, look up answers

Alternate Forms:

Same content tapped differently but equally
correlate people’s scores on both versions
harder than you might guess

Interscorer (interjudge/interater)

Same people different administrators
*Use correlation coefficient

Split Half (odd-even):

try to divide into 2 equally difficult halves, correlate the scores
useful if the cost practice effects woul impact test-retest (especially if impacts some more than others
tend to be lower b/c shorrrter can correct for thay nit still issued w/ how Ro pick your halves…led to

Coefficient Alpha:

mean of all possible split halves

Inter-Item Consistency:

degree of o correlation among all items

.9 or .95 is the goal for most tests

as low as .7 can be accepted

research sometimes accepts even lower values
should find stats in the lower test manuals

Are the items homogeneous or heterogeneous in nature