integ: psych assessment (realibilty)

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/119

There's no tags or description

Looks like no tags are added yet.

Last updated 9:33 AM on 5/12/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

120 Terms

New cards

Reliability

consistency in results regardless of how many or who takes the assessment
Synonym for dependability or consistency
Refers to something that produces similar results
proportion of total variance attributed to true variance
Precedes validity; without this a test cannot be valid

New cards

Rasch Model

each item on the test is assumed to have an equivalent relationship with the construct being measured by the test
checks whether a test question matches a person’s ability level.
ex: Student Ability
- Anna = very skilled in math
- Ben = average
- Carla = beginner
Question Difficulty
- Question 1 = easy
- Question 10 = difficult
According to the Rasch Model:
- Anna will likely answer both easy and difficult questions correctly
- Carla may only answer easy questions correctly

New cards

Observer differences

when different observers record the same behavior
even though they have the same instructions, different judges observing the same event may record different numbers
assessed using kappa statistic

New cards

Time sampling

same test given at different points in time may produce different scores, even if given to the same test takers
assessed using test-retest method

New cards

transient error

source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time

New cards

Methodological Error

when mistakes come from the way the test or research is designed and conducted.
examples are interviewers not being trained properly
wording in the questionnaire may have been ambiguous
items may have somehow been biased to favor one or another

New cards

error variance

variance from irrelevant, random sources
caused by mistakes, random factors, or outside influences — not the person’s real ability or trait.
The student is actually good at math, but:
- the room is noisy
- they did not sleep well
  -they felt nervous

New cards

true variance

variance from true differences of the test takers on the construct being measured
reflects a person’s real ability, trait, or characteristic — not mistakes or random factors.
ex: Student A is actually better in math than Student B. Their different scores reflect real ability.

New cards

Reliability Coefficient

statistic that quantifies reliability, ranging from 0 to 1

New cards

reliable

The greater the proportion of the total variance attributed to true variance, the more what the test

New cards

Perfect Reliability

1.0
may indicate redundancy / homogeneity

New cards

Excellent Reliability

≥ 0.9
minimum for clinical settings

New cards

Charles Spearman

pioneered reliability assessment.
Worked out most of the basics of contemporary reliability theory and published his work in a 1904 article entitled “The Proof and Measurement of Association between Two Things.”

New cards

Abraham De Moivre

basic notion of sampling error

New cards

Karl Pearson

developed the product moment correlation

New cards

Good Reliability

≥ 0.8 < 0.9

New cards

Acceptable Reliability

≥ 0.7 < 0.8
minimum for psychometric tests

New cards

Questionable Reliability

≥ 0.6 < 0.7
acceptable for research

New cards

Poor Reliability

≥ 0.5 < 0.6

New cards

Unacceptable Reliability

< 0.5

New cards

No Reliability

0.0

New cards

Goals of Reliability

Estimate errors (anything unaccounted for) in psychological measurement
Devise techniques to improve testing so errors are reduced

New cards

reliability

New cards

True Score

measurement of a quantity if there were no measurement error at all
Can never be observed directly
It’s approximate can be identified by averaging measurements
Tied to the measurement instrument used
Long-term average of many measurements free of carryover effects

New cards

true score

person’s real score or actual level on a trait or ability, free from measurement error.
without any mistakes or outside influences affecting the result.

New cards

Factors that influence accuracy

Time lapses between measurements
Act of measurement

New cards

Carryover Effects

measurement processes that alter what is measured
when a person’s experience in one condition or test affects their performance in the next condition or test.

New cards

Practice Effects

test itself provides an opportunity to learn and practice the ability being measured (increase of score due to test taker)
when a person’s test performance improves because they have already taken the test before.

New cards

Test Sophistication

increase of score due to the test
person’s familiarity, experience, or skill in taking tests.
being “good at taking tests” because of experience with testing situations.
already knows:
- how tests work,
- how questions are usually asked,
- and strategies for answering them.

New cards

test sophistication

Two students have the same intelligence level.

Student A rarely takes standardized tests.
Student B often takes entrance exams and online practice tests.

Student B may score higher because they:

know time-management strategies,
are comfortable with multiple-choice questions,
and feel less anxious.

New cards

Fatigue Effects

repeated testing reduces overall mental energy or motivation to perform on a test
when a person’s performance becomes worse because they are physically or mentally tired during testing.

New cards

reliable tests

give scores that closely approximate true scores

New cards

valid tests

give scores that closely approximate construct scores

New cards

Construct Score

person’s standing on a theoretical variable independent of any particular measurement

tells how much of a certain psychological trait a person has based on test results.

New cards

concept of reliability

New cards

Variance

useful in describing sources of test score variability; the standard deviation squared
shows how spread out or different scores are from the average (mean).
whether scores are very similar to each other, or very spread apart.

New cards

Bias

degree to which a measure predictably overestimates or underestimates a quantity
anything in a test, assessment, or situation that causes results to be unfair, inaccurate, or systematically favor one group over another.

New cards

Measurement Error

inherent uncertainty associated with any measurement, even after care has been taken to minimize preventable mistakes
when a test score is affected by factors unrelated to the person’s true ability or trait.
Ex. a ruler may be accurate in some areas but not all
A student normally performs well in math.
But during the test:
- they are tired,
- cannot concentrate,
- and score lower than usual.
The lower score is partly caused by

New cards

Error

refers to the component of the observed test score that does not have to do with the test taker's ability
anything that causes a test score to differ from a person’s true score.
anything that makes a test score imperfect or inaccurate.

New cards

Random Error

source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process
has no consistent pattern and can increase or decrease a score in any direction.
Affects precision
Also called “noise:”
Ex. physical events that occur during a test

New cards

Systematic Error

source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured
Affects accuracy
Either consistently inflate scores or consistently deflate scores
consistent and repeatable error that affects test scores in the same direction (either making scores too high or too low).

New cards

random error

A student takes two similar tests on different days:
- Day 1: feeling happy → scores higher
- Day 2: feeling tired → scores lower
A participant is taking a memory test. During one session:
- a loud noise suddenly occurs,
- they lose focus for a few seconds,
- and miss a few items.
On a multiple-choice test:
- a person guesses several answers,
- some guesses are correct by chance,
- others are wrong.

New cards

systematic error

A math test uses word problems based on city experiences (e.g., subway systems, traffic in big cities).
- Students from rural areas may consistently score lower—not because they are less skilled, but because the content is unfamiliar.
If instructions are unclear every time the test is given:
- many participants misunderstand questions,
- and consistently lose points.
A teacher believes certain students are “weak.”: As a result:
- they mark those students more strictly every time,
- regardless of actual performance.

New cards

Item Sampling

refer to variation among items within a test as well as to variation among items between tests
Also called content sampling
Extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled
when a test only uses a small set of questions to represent a much larger skill or trait.

New cards

Test Administration

Testtaker’s reactions to error variance that occur during test administration are the source of one kind of error variance
giving, conducting, and managing a psychological test in a standardized way so that results are fair, accurate, and consistent.
how the test is given, under what conditions, and who follows the instructions.

New cards

Test Environment

room temperature, level of lighting, and amount of ventilation and noise

New cards

Testtaker Variables

emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication

New cards

Examiner-Related Variables

examiner’s physical appearance and demeanor

New cards

Test Scoring and Interpretation

Technical glitches in computer scoring may contaminate data
Element of subjectivity in scoring

New cards

Test Scoring

where responses are turned into numerical values.
counting the results

New cards

test interpretation

understanding what those scores mean in terms of ability, traits, or behavior.
explaining what the results mean

New cards

test-retest

stability of a measure by correlating pairs of scores from the same people on 2 different administrations
Source of error is time sampling
Ideal time is 2-4 weeks
Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time (ex. Personality trait)
↑ interval between tests = ↓ correlation / reliability
test designed for various use

<ul><li><p><span style="background-color: transparent;">stability of a measure by correlating pairs of scores from the same people on 2 different administrations</span></p></li><li><p><span style="background-color: transparent;">Source of error is time sampling</span></p></li><li><p><span style="background-color: transparent;">Ideal time is 2-4 weeks</span></p></li><li><p><span style="background-color: transparent;">Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time (ex. Personality trait)</span></p></li><li><p><span style="background-color: transparent;">↑ interval between tests = ↓ correlation / reliability</span></p></li><li><p><span style="background-color: transparent;">test designed for <strong>various use</strong></span></p></li></ul><p></p>

New cards

Coefficient of Stability

estimate of test-retest reliability when the interval between testing is > 6 months
how stable or consistent a test score is across time.

New cards

alternate forms

evaluates the correlation between 2 different forms of a test

New cards

Coefficient of Equivalence

estimate of alternate-forms or parallel-forms reliability
a measure of how similar scores are when a person takes two different but equivalent versions of the same test.
if two versions of a test give similar scores for the same people.

New cards

Parallel Forms Reliability

for each form of the test, the means and the variances of observed test scores are equal
Means of scores obtained correlate equally with the true score

New cards

Alternate Forms Reliability

different versions of a test that have been constructed so as to be parallel
Typically designed to be equivalent with respect to variables such as content and level of difficulty
Can be time-consuming and expensive
Ex. Army Alpha and Army Beta

New cards

Immediate Form

administered at the same time

New cards

Delayed Form

interval between both administrations

New cards

internal consistency

test designed for single administration onl

<p><span style="background-color: transparent;">test designed for <strong>single administration</strong> onl</span></p>

New cards

inter-scorer

degree of agreement or consistency between two or more scorers with regard to a particular measure
Also called scorer reliability, judge reliability, observer reliability, and inter-scorer reliability
Often used when coding nonverbal behavior

<ul><li><p><span style="background-color: transparent;">degree of agreement or consistency between two or more scorers with regard to a particular measure</span></p></li><li><p><span style="background-color: transparent;">Also called scorer reliability, judge reliability, observer reliability, and inter-scorer reliability</span></p></li><li><p><span style="background-color: transparent;">Often used when coding nonverbal behavior</span></p></li></ul><p></p>

New cards

Coefficient of Inter-Scorer Reliability

degree of consistency among scorers in the scoring of a test

New cards

Kappa Statistics

used for nominal data
used to check how much agreement exists between two raters or judges, beyond what would happen just by chance.

New cards

Cohen’s Kappa

used to measure the level of agreement between two raters or judges only

New cards

Fleiss Kappa

determine the level of agreement between two or more raters

New cards

Kendall’s W

used for rankings / ordinal data in interrater reliability

New cards

Perfect Agreement

in COHEN’S KAPPA RANGES, 1.0

New cards

Near Perfect Agreement

in COHEN’S KAPPA RANGES, 0.81 - 0.99

New cards

Substantial Agreement

in COHEN’S KAPPA RANGES, 0.61 - 0.80

New cards

Moderate Agreement

in COHEN’S KAPPA RANGES, 0.41 - 0.60

New cards

Fair Agreement

in COHEN’S KAPPA RANGES, 0.21 - 0.40

New cards

Slight Agreement

in COHEN’S KAPPA RANGES, 0.10 - 0.20

New cards

No Agreement

in COHEN’S KAPPA RANGES, 0.0

New cards

Split-Half Reliability

method of internal consistency that correlates 2 pairs of scores obtained from equivalent halves of a single test administered once
Appropriate when evaluating psychological variables that are more state-like than trait-like

New cards

Odd-Even Reliability

assigning odd-numbered items to one half of the test and even-numbered items to the other half

New cards

Spearman–Brown Formula

used to estimate internal consistency reliability from a correlation between two halves of a test
Can also be used to estimate the effect of shortening the test on the test’s reliability
↑ length = ↑ reliability
could also be used to determine the number of items needed to attain a desired level of reliability
can help estimate how many items will have to be added in order to bring a test to an acceptable level of reliability

New cards

Coefficient Alpha

Also called cronbach’s alpha
Measure non-dichotomous items
May range in value from 0 to 1 only
Helps answer questions about how similar sets of data are
Accurately measures internal consistency when multiple loadings are equal (ex. Likert scale)

New cards

KR-20

dichotomous items with varying levels of difficulty
where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar

New cards

KR-21

dichotomous items with uniform level of difficulty

New cards

Average Proportional Distance

measure used to evaluate internal consistency of a test that focuses on the degree of differences that exists between item scores
Not connected to the number of items on a measure
way of measuring how much scores differ from each other on average, in proportion to a reference value (usually the mean or total score).
how far scores are from each other

New cards

spearman-brown formula (half-test)

New cards

Spearman-Brown Formula (whole test)

New cards

Coefficient Alpha

New cards

Excellent Consistency

𝜶 ≥ 0.9

New cards

Good Consistency

0.9 > 𝜶 ≥ 0.8

New cards

Acceptable Consistency

0.8 > 𝜶 ≥ 0.7

New cards

Questionable Consistency

0.7 > 𝜶 ≥ 0.6

New cards

Poor Consistency

0.6 > 𝜶 ≥ 0.5

New cards

Unacceptable

0.5 > 𝜶

New cards

Homogenous

uniform items; measures only 1 factor; high internal consistency
↑ internal consistency ≠ homogeneity
Items must be positively correlated

New cards

Heterogenous

measures > 1 factor, low in internal consistency

New cards

Dynamic

a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences (ex. anxiety)

New cards

Static

a trait, state, or ability presumed to be relatively unchanging (ex. intelligence

New cards

Restriction

resulting correlation coefficient tends to be lower

New cards

Inflation

resulting correlation coefficient tends to be higher

New cards

Speed Tests

contains items of uniform level of difficulty and within a time limit
Should be based on performance from two independent testing periods using one of the following
- Test-retest reliability
- Alternate-forms reliability,
- Split-half reliability from two separately timed half tests
- If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman–Brown formula

New cards

Power Tests

difficult items, time limit is long enough to allow test takers to attempt all items

New cards

Criterion-Referenced Tests

designed to provide an indication of where a testtaker stands with respect to some variable or criterion

100

New cards

Classical Test Theory (CTT)

true score model of measurement
how test scores are made up of true score + error.
every observed test score is not perfect, but a combination of:
- the person’s real ability, and
- random measurement error.
In favor of longer tests
Most widely used and accepted model in the psychometric literature
Much simpler to understand than IRT

<ul><li><p><span style="background-color: transparent;">true score model of measurement</span></p></li><li><p>how test scores are made up of <strong>true score + error</strong>. </p></li><li><p>every observed test score is not perfect, but a combination of:</p><ul><li><p>the person’s real ability, and</p></li><li><p>random measurement error.</p></li></ul></li><li><p><span style="background-color: transparent;">In favor of longer tests</span></p></li><li><p><span style="background-color: transparent;">Most widely used and accepted model in the psychometric literature</span></p></li><li><p><span style="background-color: transparent;">Much simpler to understand than IRT</span></p></li></ul><p></p>