integ: psych assessment (realibilty)

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/119

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 9:33 AM on 5/12/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

120 Terms

1
New cards

Reliability

  • consistency in results regardless of how many or who takes the assessment

  • Synonym for dependability or consistency

  • Refers to something that produces similar results

  • proportion of total variance attributed to true variance

  • Precedes validity; without this a test cannot be valid

2
New cards
3
New cards

Rasch Model

  • each item on the test is assumed to have an equivalent relationship with the construct being measured by the test

  • checks whether a test question matches a person’s ability level.

  • ex: Student Ability

    • Anna = very skilled in math

    • Ben = average

    • Carla = beginner

  • Question Difficulty

    • Question 1 = easy

    • Question 10 = difficult

  • According to the Rasch Model:

    • Anna will likely answer both easy and difficult questions correctly

    • Carla may only answer easy questions correctly

4
New cards

Observer differences

  • when different observers record the same behavior

  • even though they have the same instructions, different judges observing the same event may record different numbers

  • assessed using kappa statistic

5
New cards

Time sampling

  • same test given at different points in time may produce different scores, even if given to the same test takers

  • assessed using test-retest method

6
New cards

transient error

source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time

7
New cards

Methodological Error

  • when mistakes come from the way the test or research is designed and conducted.

  • examples are interviewers not being trained properly

  • wording in the questionnaire may have been ambiguous

  • items may have somehow been biased to favor one or another

8
New cards

error variance

  • variance from irrelevant, random sources

  • caused by mistakes, random factors, or outside influences — not the person’s real ability or trait.

  • The student is actually good at math, but:

    • the room is noisy

    • they did not sleep well
      -they felt nervous

9
New cards

true variance

  • variance from true differences of the test takers on the construct being measured

  • reflects a person’s real ability, trait, or characteristic — not mistakes or random factors.

  • ex: Student A is actually better in math than Student B. Their different scores reflect real ability.

10
New cards

Reliability Coefficient

statistic that quantifies reliability, ranging from 0 to 1

11
New cards

reliable

The greater the proportion of the total variance attributed to true variance, the more what the test

12
New cards

Perfect Reliability

  • 1.0

  • may indicate redundancy / homogeneity

13
New cards

Excellent Reliability

  • ≥ 0.9

  • minimum for clinical settings

14
New cards

Charles Spearman

  • pioneered reliability assessment.

  • Worked out most of the basics of contemporary reliability theory and published his work in a 1904 article entitled “The Proof and Measurement of Association between Two Things.”

15
New cards

Abraham De Moivre

basic notion of sampling error

16
New cards

Karl Pearson

developed the product moment correlation

17
New cards

Good Reliability

≥ 0.8 < 0.9

18
New cards

Acceptable Reliability

  • ≥ 0.7 < 0.8

  • minimum for psychometric tests

19
New cards

Questionable Reliability

  • ≥ 0.6 < 0.7

  • acceptable for research

20
New cards

Poor Reliability

≥ 0.5 < 0.6

21
New cards

Unacceptable Reliability

< 0.5

22
New cards

No Reliability

0.0

23
New cards

Goals of Reliability

  • Estimate errors (anything unaccounted for) in psychological measurement

  • Devise techniques to improve testing so errors are reduced

24
New cards

reliability

knowt flashcard image
25
New cards

True Score

  • measurement of a quantity if there were no measurement error at all

  • Can never be observed directly

  • It’s approximate can be identified by averaging measurements

  • Tied to the measurement instrument used

  • Long-term average of many measurements free of carryover effects

26
New cards

true score

  • person’s real score or actual level on a trait or ability, free from measurement error.

  • without any mistakes or outside influences affecting the result.

27
New cards

Factors that influence accuracy

  • Time lapses between measurements

  • Act of measurement

28
New cards

Carryover Effects

  • measurement processes that alter what is measured

  • when a person’s experience in one condition or test affects their performance in the next condition or test.

29
New cards

Practice Effects

  • test itself provides an opportunity to learn and practice the ability being measured (increase of score due to test taker)

  • when a person’s test performance improves because they have already taken the test before.

30
New cards

Test Sophistication

  • increase of score due to the test

  • person’s familiarity, experience, or skill in taking tests.

  • being “good at taking tests” because of experience with testing situations.

  • already knows:

    • how tests work,

    • how questions are usually asked,

    • and strategies for answering them.

31
New cards

test sophistication

Two students have the same intelligence level.

  • Student A rarely takes standardized tests.

  • Student B often takes entrance exams and online practice tests.

Student B may score higher because they:

  • know time-management strategies,

  • are comfortable with multiple-choice questions,

  • and feel less anxious.

32
New cards

Fatigue Effects

  • repeated testing reduces overall mental energy or motivation to perform on a test

  • when a person’s performance becomes worse because they are physically or mentally tired during testing.

33
New cards

reliable tests

give scores that closely approximate true scores

34
New cards

valid tests

give scores that closely approximate construct scores

35
New cards

Construct Score

person’s standing on a theoretical variable independent of any particular measurement

tells how much of a certain psychological trait a person has based on test results.

36
New cards

concept of reliability

knowt flashcard image
37
New cards

Variance

  • useful in describing sources of test score variability; the standard deviation squared

  • shows how spread out or different scores are from the average (mean).

  • whether scores are very similar to each other, or very spread apart.

38
New cards

Bias

  • degree to which a measure predictably overestimates or underestimates a quantity

  • anything in a test, assessment, or situation that causes results to be unfair, inaccurate, or systematically favor one group over another.

39
New cards

Measurement Error

  • inherent uncertainty associated with any measurement, even after care has been taken to minimize preventable mistakes

  • when a test score is affected by factors unrelated to the person’s true ability or trait.

  • Ex. a ruler may be accurate in some areas but not all

  • A student normally performs well in math.

  • But during the test:

    • they are tired,

    • cannot concentrate,

    • and score lower than usual.

  • The lower score is partly caused by

40
New cards

Error

  • refers to the component of the observed test score that does not have to do with the test taker's ability

  • anything that causes a test score to differ from a person’s true score.

  • anything that makes a test score imperfect or inaccurate.

41
New cards

Random Error

  • source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process

  • has no consistent pattern and can increase or decrease a score in any direction.

  •  Affects precision

  • Also called “noise:”

  • Ex. physical events that occur during a test

42
New cards

Systematic Error

  • source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured

  • Affects accuracy

  • Either consistently inflate scores or consistently deflate scores

  • consistent and repeatable error that affects test scores in the same direction (either making scores too high or too low).

43
New cards

random error

  • A student takes two similar tests on different days:

    • Day 1: feeling happy → scores higher

    • Day 2: feeling tired → scores lower

  • A participant is taking a memory test. During one session:

    • a loud noise suddenly occurs,

    • they lose focus for a few seconds,

    • and miss a few items.

  • On a multiple-choice test:

    • a person guesses several answers,

    • some guesses are correct by chance,

    • others are wrong.

44
New cards

systematic error

  • A math test uses word problems based on city experiences (e.g., subway systems, traffic in big cities).

    • Students from rural areas may consistently score lower—not because they are less skilled, but because the content is unfamiliar.

  • If instructions are unclear every time the test is given:

    • many participants misunderstand questions,

    • and consistently lose points.

  • A teacher believes certain students are “weak.”: As a result:

    • they mark those students more strictly every time,

    • regardless of actual performance.

45
New cards

Item Sampling

  • refer to variation among items within a test as well as to variation among items between tests 

  • Also called content sampling

  • Extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled

  • when a test only uses a small set of questions to represent a much larger skill or trait.

46
New cards

Test Administration

  • Testtaker’s reactions to error variance that occur during test administration are the source of one kind of error variance

  • giving, conducting, and managing a psychological test in a standardized way so that results are fair, accurate, and consistent.

  • how the test is given, under what conditions, and who follows the instructions.

47
New cards

Test Environment

room temperature, level of lighting, and amount of ventilation and noise

48
New cards

Testtaker Variables

emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication

49
New cards

Examiner-Related Variables

examiner’s physical appearance and demeanor

50
New cards

Test Scoring and Interpretation

  • Technical glitches in computer scoring may contaminate data

  • Element of subjectivity in scoring

51
New cards

Test Scoring

  • where responses are turned into numerical values.

  • counting the results

52
New cards

test interpretation

  • understanding what those scores mean in terms of ability, traits, or behavior.

  • explaining what the results mean

53
New cards

test-retest

  • stability of a measure by correlating pairs of scores from the same people on 2 different administrations

  • Source of error is time sampling

  • Ideal time is 2-4 weeks

  • Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time (ex. Personality trait)

  • ↑ interval between tests = ↓ correlation / reliability

  • test designed for various use

<ul><li><p><span style="background-color: transparent;">stability of a measure by correlating pairs of scores from the same people on 2 different administrations</span></p></li><li><p><span style="background-color: transparent;">Source of error is time sampling</span></p></li><li><p><span style="background-color: transparent;">Ideal time is 2-4 weeks</span></p></li><li><p><span style="background-color: transparent;">Appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time (ex. Personality trait)</span></p></li><li><p><span style="background-color: transparent;">↑ interval between tests = ↓ correlation / reliability</span></p></li><li><p><span style="background-color: transparent;">test designed for <strong>various use</strong></span></p></li></ul><p></p>
54
New cards

Coefficient of Stability

  • estimate of test-retest reliability when the interval between testing is > 6 months

  • how stable or consistent a test score is across time.

55
New cards

alternate forms

evaluates the correlation between 2 different forms of a test

<p><span style="background-color: transparent;">evaluates the correlation between 2 different forms of a test</span></p>
56
New cards

Coefficient of Equivalence

  • estimate of alternate-forms or parallel-forms reliability

  • a measure of how similar scores are when a person takes two different but equivalent versions of the same test.

  • if two versions of a test give similar scores for the same people.

57
New cards

Parallel Forms Reliability

  • for each form of the test, the means and the variances of observed test scores are equal

  • Means of scores obtained correlate equally with the true score

58
New cards

Alternate Forms Reliability

  • different versions of a test that have been constructed so as to be parallel

  • Typically designed to be equivalent with respect to variables such as content and level of difficulty

  • Can be time-consuming and expensive

  • Ex. Army Alpha and Army Beta

59
New cards

Immediate Form

administered at the same time

60
New cards

Delayed Form

interval between both administrations

61
New cards

internal consistency

test designed for single administration onl

<p><span style="background-color: transparent;">test designed for <strong>single administration</strong> onl</span></p>
62
New cards

inter-scorer

  • degree of agreement or consistency between two or more scorers with regard to a particular measure

  • Also called scorer reliability, judge reliability, observer reliability, and inter-scorer reliability

  • Often used when coding nonverbal behavior

<ul><li><p><span style="background-color: transparent;">degree of agreement or consistency between two or more scorers with regard to a particular measure</span></p></li><li><p><span style="background-color: transparent;">Also called scorer reliability, judge reliability, observer reliability, and inter-scorer reliability</span></p></li><li><p><span style="background-color: transparent;">Often used when coding nonverbal behavior</span></p></li></ul><p></p>
63
New cards

Coefficient of Inter-Scorer Reliability

degree of consistency among scorers in the scoring of a test

64
New cards

Kappa Statistics

  • used for nominal data

  • used to check how much agreement exists between two raters or judges, beyond what would happen just by chance.

65
New cards

Cohen’s Kappa

used to measure the level of agreement between two raters or judges only

66
New cards

Fleiss Kappa

determine the level of agreement between two or more raters

67
New cards

Kendall’s W

used for rankings / ordinal data in interrater reliability

68
New cards

Perfect Agreement

in COHEN’S KAPPA RANGES, 1.0

69
New cards

Near Perfect Agreement

in COHEN’S KAPPA RANGES, 0.81 - 0.99

70
New cards

Substantial Agreement

in COHEN’S KAPPA RANGES, 0.61 - 0.80

71
New cards

Moderate Agreement

in COHEN’S KAPPA RANGES, 0.41 - 0.60

72
New cards

Fair Agreement

in COHEN’S KAPPA RANGES, 0.21 - 0.40

73
New cards

Slight Agreement

in COHEN’S KAPPA RANGES, 0.10 - 0.20

74
New cards

No Agreement

in COHEN’S KAPPA RANGES, 0.0

75
New cards

Split-Half Reliability

  • method of internal consistency that correlates 2 pairs of scores obtained from equivalent halves of a single test administered once

  • Appropriate when evaluating psychological variables that are more state-like than trait-like

76
New cards

Odd-Even Reliability

assigning odd-numbered items to one half of the test and even-numbered items to the other half

77
New cards

Spearman–Brown Formula

  • used to estimate internal consistency reliability from a correlation between two halves of a test

  • Can also be used to estimate the effect of shortening the test on the test’s reliability

  • ↑ length = ↑ reliability 

  • could also be used to determine the number of items needed to attain a desired level of reliability

  • can help estimate how many items will have to be added in order to bring a test to an acceptable level of reliability

78
New cards

Coefficient Alpha

  • Also called cronbach’s alpha

  • Measure non-dichotomous items

  • May range in value from 0 to 1 only

  • Helps answer questions about how similar sets of data are

  • Accurately measures internal consistency when multiple loadings are equal (ex. Likert scale)

79
New cards

KR-20

  • dichotomous items with varying levels of difficulty

  • where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar

80
New cards

KR-21

dichotomous items with uniform level of difficulty

81
New cards

Average Proportional Distance

  • measure used to evaluate internal consistency of a test that focuses on the degree of differences that exists between item scores

  • Not connected to the number of items on a measure

  • way of measuring how much scores differ from each other on average, in proportion to a reference value (usually the mean or total score).

  • how far scores are from each other

82
New cards

spearman-brown formula (half-test)

knowt flashcard image
83
New cards

Spearman-Brown Formula (whole test)

knowt flashcard image
84
New cards

Coefficient Alpha

knowt flashcard image
85
New cards

Excellent Consistency

𝜶 ≥ 0.9

86
New cards

Good Consistency

0.9 > 𝜶 ≥ 0.8

87
New cards

Acceptable Consistency

0.8 > 𝜶 ≥ 0.7

88
New cards

Questionable Consistency

0.7 > 𝜶 ≥ 0.6

89
New cards

Poor Consistency

0.6 > 𝜶 ≥ 0.5

90
New cards

Unacceptable

0.5 > 𝜶

91
New cards

Homogenous

  • uniform items; measures only 1 factor; high internal consistency

  • ↑ internal consistency ≠ homogeneity

  • Items must be positively correlated

92
New cards

Heterogenous

measures > 1 factor, low in internal consistency

93
New cards

Dynamic

a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences (ex. anxiety)

94
New cards

Static

a trait, state, or ability presumed to be relatively unchanging (ex. intelligence

95
New cards

Restriction

resulting correlation coefficient tends to be lower

96
New cards

Inflation

resulting correlation coefficient tends to be higher

97
New cards

Speed Tests

  • contains items of uniform level of difficulty and within a time limit

  • Should be based on performance from two independent testing periods using one of the following

    • Test-retest reliability

    • Alternate-forms reliability,

    • Split-half reliability from two separately timed half tests

    • If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman–Brown formula

98
New cards

Power Tests

difficult items, time limit is long enough to allow test takers to attempt all items

99
New cards

Criterion-Referenced Tests

designed to provide an indication of where a testtaker stands with respect to some variable or criterion

100
New cards

Classical Test Theory (CTT)

  • true score model of measurement

  • how test scores are made up of true score + error.

  • every observed test score is not perfect, but a combination of:

    • the person’s real ability, and

    • random measurement error.

  • In favor of longer tests

  • Most widely used and accepted model in the psychometric literature

  • Much simpler to understand than IRT

<ul><li><p><span style="background-color: transparent;">true score model of measurement</span></p></li><li><p>how test scores are made up of <strong>true score + error</strong>. </p></li><li><p>every observed test score is not perfect, but a combination of:</p><ul><li><p>the person’s real ability, and</p></li><li><p>random measurement error.</p></li></ul></li><li><p><span style="background-color: transparent;">In favor of longer tests</span></p></li><li><p><span style="background-color: transparent;">Most widely used and accepted model in the psychometric literature</span></p></li><li><p><span style="background-color: transparent;">Much simpler to understand than IRT</span></p></li></ul><p></p>