Psych Tests Exam #2 (Chapter 5--7)

5.0(1)
studied byStudied by 15 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/108

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 3:34 AM on 10/29/24
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

109 Terms

1
New cards

what is reliability?

consistency of measurement

2
New cards

what is reliability coefficient?

is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance

3
New cards

what are the components of an observed score?

true score + measurement error (X= T + E)

4
New cards

error

refers to the component of the observed score that does not have to do with the testtakers true ability or trait being measured

5
New cards

variance

standard deviation squared

6
New cards

variance equals

true variance plus error variance

7
New cards

measurement error

all of the factors associated with the process of measuring some variable, other than the variable being measured.

8
New cards

What is random error?

a source of error in measuring a targeted variable caused by
UNPREDICTABLE FLUCTUATIONS and inconsistencies of other variables in the measurement process (i.e., noise)

9
New cards

What is systematic error?

a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.

10
New cards

Test Construction

Variation may exist within items on a test or between tests (i.e., item sampling or content sampling).

11
New cards

Test Administration

Sources of error may stem from the testing environment.

12
New cards

Testtaker Variables

pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication

13
New cards

Examiner-related variables

physical appearance, demeanor, eye contact

14
New cards

Test Scoring and Interpretation

Computer testing reduces errors in test scoring, but many tests still require expert interpretation (e.g., projective tests).

15
New cards

Sampling Error

the extent to which the sample in the study actually is
representative of the population

16
New cards

Methodological error

training, ambiguous wording in questionnaire,
biased framing of questions

17
New cards

Test-retest reliability

a method for determining the reliability of a test by comparing a test taker's scores on the same test taken on separate occasions (same person takes the test twice)

18
New cards

What kind of variable is the test-retest appropriate for? inappropriate for what kind of variable?

personality (stable); mood (changes)

19
New cards

Coefficient of Stability

With intervals over 6 months, the estimate of test-retest reliability is called the

20
New cards

coefficient of equivalence

The degree of the relationship between various forms of a test.

21
New cards

parallel forms reliability

for each form of the test, the means and the variances of observed test scores are equal

22
New cards

alternate forms reliability

different versions of a test that have been constructed so as to be parallel.

Do not meet the strict requirements of parallel forms but typically item content and difficulty is similar between tests

23
New cards

How is reliability checked?

By administering two forms of a test to the same group. Scores may be affected by error related to the state of testtakers (e.g., practice, fatigue, etc.) or item sampling

24
New cards

split-half reliability

is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.

25
New cards

3 steps of split half reliability

1. Divide the test into equivalent halves
2. Calculate a Pearson r between scores on the two halves of the test
3. Adjust the half-test reliability using the Spearman-Brown formula

26
New cards

spearman-brown formula

allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.

27
New cards

inter-item consistency

The degree of relatedness of items within a test. Able to gauge the HOMOGENEITY of a test

28
New cards

Kuder-Richardson formula 20

statistic of choice for determining the inter-item consistency of DICHOTOMOUS ITEMS

29
New cards

coefficient alpha

mean of all possible split-half correlations, corrected by the Spearman-Brown formula. The most popular approach for internal consistency.

30
New cards

what are the coefficient alpha values?

Values range from 0 to 1

31
New cards

inter-scorer reliability

The degree of the agreement of consistency between two or more scores (or judges or raters) with regard to a particular measure

-It is often used with behavioral measures

-Guards against biases or idiosyncrasies in scoring

32
New cards

coefficient of inter-score reliability

The scores from different raters are correlated with one another.

33
New cards

the more __________ the closer together the items you are creating are similar, THE HIGHER THE RELIABILITY IS 

homogeneous

34
New cards

How quickly is it changing, how static is it and this will change the ….

TEST-RETEST RELIABILITY 

35
New cards

true score

a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test.

36
New cards

true-score model is often referred to as

Classical Test Theory (CTT)—Perhaps the most widely used model due to its simplicity.

37
New cards

Generalizability theory

a  person’s test scores may vary because of variables in the testing situation and sampling.

38
New cards

item-response theory

Provides a way to model the probability that a person with X ability will be able to perform at a level of Y.

39
New cards

What is standard error of measurement?

often abbreviated as SEM, provides a measure of the PRECISON of an OBSERVED TEST SCORE. An estimate of the amount of error inherent in an observed score or measurement.

40
New cards

do we want small or large standard error of measurement?

Small

41
New cards

the higher the reliability of the test…

the lower the standard error

42
New cards

The standard error can be used to estimate the extent to which an observed score

DEVIATES from a true score

43
New cards

confidence interval

a range or band of test scores that is likely to contain the true score

44
New cards

Confidence level of 95%

2 z-value

45
New cards

What are some cultural considerations in test construction/standardization?

  • Check how appropriate are they for use with the targeted test taker population

  • When interpreting test results it helps to know about the culture and era of the test-taker

  • It is important to conduct a culturally-informed assessment 

46
New cards

Describe the difference between random error and systematic error

Random error is unpredictable and varies each time, causing results to go up or down by chance.

Systematic error is consistent and always skews results in the same direction.

Random error affects consistency, while systematic error affects accuracy.

47
New cards

When is it appropriate to use test-rest reliability?

Use when you want to assess the stability of a test over time. This index is most suitable when the characteristic being measured is expected to remain stable (e.g., intelligence, personality traits).

Example: A researcher administering the same personality test to the same group of people at two different times to see if the scores remain consistent.

48
New cards

When is it appropriate to use parallel or alternate forms reliability?

Use when you want to check the consistency between two different versions of the same test. It’s ideal when repeated testing could lead to practice effects (people getting better just because they remember the questions).

Example: Creating two different versions of a math test to ensure that both versions measure math skills equally well.

49
New cards

When is it appropriate to use split-half reliability?

Use when you want to measure the internal consistency of a test. The test is split into two halves, and the scores are compared. It’s suitable for tests where all items are meant to measure the same underlying construct.

Example: A long questionnaire is divided into odd and even items, and the correlation between these halves is assessed.

50
New cards

When is it appropriate to use inter-item consistency?

Use when you want to determine the degree to which items on a test measure the same construct. It’s suitable for tests or questionnaires with multiple items aiming to assess the same concept.

Example: Checking if all questions in a depression inventory are related to the same concept of depression.

51
New cards

When is it appropriate to use coefficient alpha (cronbach’s Alpha)?

Use when you want a measure of overall internal consistency for a test. This is suitable when you have a test or survey with multiple items, all aimed at measuring the same construct, and you need a single reliability estimate.

Example: Assessing the reliability of a 20-item scale that measures self-esteem.

52
New cards

When is it appropriate to use inter-scorer reliability?

Use when you want to determine the consistency of scores assigned by different raters or judges. This is crucial when the scoring involves subjective judgment.

Example: Multiple judges rating the creativity of children's artwork and checking the agreement among their scores.

53
New cards

How does homogeneity vs heterogeneity of test items impact reliability?

homogeneous items increase reliability because they reflect the same construct, leading to higher internal consistency. In contrast, heterogeneous items decrease reliability if measured by indices assuming unidimensionality, but they might still provide a comprehensive assessment of a broader concept.

54
New cards

What is the relation between the range of test scores and reliability?

a broader range of scores usually leads to higher reliability, while a restricted range reduces reliability estimates, potentially underestimating the test's true ability to measure what it’s supposed to.

55
New cards

What is the impact of a speed test or power test on reliability?

power tests (questions get harder as the tests go on) and speed tests (complete as many questions as you can under a certain amount of time) are NOT expected to have high reliability

therefore, reliability is not expected to be entirely useful, and people would not readily lean on them

56
New cards

Assumptions of Classical Test Theory (CTT)

  • people will have a true score on the construct

  • there will be error in every true score

  • error are not correlated with the true score (random)

  • errors are normally distributed

  • errors cancel themselves out

  • the greater items the greater reliability

57
New cards

Pros and Cons of Classical Test Theory (CTT)

Pros:

  • the assumptions are easily met

  • it is simple and easier to understand

  • most psychological tests are developed using this

Cons:

  • assumes the items on a test are all equal in their ability to measure the construct

  • long measures

58
New cards

Assumptions of Item Response Theory (IRT)

  • Unidimensionality: assumes that a single latent trait or ability (e.g., math ability or reading comprehension) primarily drives responses to all items on a test.

  • Local Independence: if you know a person’s ability level, knowing their response to one item shouldn’t give additional information about their response to another.

  • Monotonicity: For each item, the probability of a correct response increases as the latent trait level increases. This means that higher levels of the trait should lead to a higher likelihood of getting the item right (or agreeing with it in the case of attitude measures).

  • Invariance of Item Parameters: Item parameters (like difficulty and discrimination) are assumed to be the same across different groups, meaning that items function similarly regardless of the specific sample of test-takers.

59
New cards

What is difficulty and discrimination in IRT?

Difficulty determines where on the ability scale the item is targeted.

Discrimination determines how sharply the item distinguishes between different ability levels.

A well-designed IRT-based test aims to have items with a range of difficulties and high discrimination to accurately measure individuals across the ability spectrum.

60
New cards

SAMPLE PROBLEM (calculate the confidence interval if given the standard error of measurement is 4 and the confidence level index (e.g., 95% confidence level z-score of 2). The observed score is 78.

Observed score= 78, SEM= 4, z-score for 95%= 2:

78 ± (2×4) so 78 ± 8

The confidence interval is 70 to 86

61
New cards

SAMPLE PROBLEM #2 for CONFIDENCE INTERVAL: A standardized test has a standard error of measurement (SEM) of 3 points. If a student's observed score is 85, calculate the 95% confidence interval for the student's true score. Use a z-score of 2, which corresponds to a 95% confidence level.

Observed score= 85, SEM= 3, z-score for 95% confidence=2

85 ± (2×3) so 85 ± 6 which means the confidence interval is 79 to 91 so the student’s true score lies within that range

62
New cards

How to use confidence interval information to interpret test scores?

  • Understanding Accuracy: A narrow confidence interval indicates a more precise estimate of the true score, while a wide interval suggests less precision.

  • Comparing Scores: When comparing two students' test scores, overlapping confidence intervals may suggest that the difference in their scores isn't significant, while non-overlapping intervals imply a real difference.

  • Decision-Making: Confidence intervals can help decide whether a student's score meets a cutoff or threshold (e.g., passing/failing), as it takes into account measurement error. For example, if a passing score is 80 and a student's CI is 78 to 84, there's a chance their true score is both above and below the passing threshold.

63
New cards

What is the standard error of the difference and how does it help us with interpretation of two scores?

A measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant.

It helps us interpret two score because can see if there was an improvement or change. (If they are statistically different, then the true scores cannot overlap)

64
New cards

Three questions that standard error of difference can help us answer:

1. How did this individual’s performance on test 1 compare to their own performance on test 2?

2. How did this individual’s performance on test 1 compare with someone else’s performance on test 1?

3. How did this individual’s performance on test 1 compare with someone else’s performance on test 2?

65
New cards

What is validity?

 a judgment or estimate of how well a test measures what it is supposed to measure within a particular context

66
New cards

Relationship between reliability and validity

Reliability is about consistency, while validity is about accuracy. You can have a test that is reliable but not valid, but you can't have a valid test that isn't reliable.

67
New cards

Content validity

Evaluation of the subjects, topics, or content covered by the items in the test

Typically established by recruiting a ___team of experts_________ on the subject matter and obtaining ___expert ratings_______ on the degree of item importance as well as scrutinizing what is missing from the measure

68
New cards

Criterion-related validity

Evaluating the relationship of scores obtained on the test to scores on other tests or measures

A criterion is the standard against which a test or a test score is evaluated

69
New cards

Construct validity

This is a measure of validity that is arrived at by executing a comprehensive analysis of:

  • how scores on the test relate to other test scores and measures

  • how scores can be interpreted within a theoretical framework that explains the construct the test was designed to measure

70
New cards

“Pyramid of validity” order

  1. Construct validity

  2. Criterion-related validity

  3. Content validity

71
New cards

Face validity

a judgment concerning how relevant the test items appear to be

  • If a test appears to measure what it is supposed to be measuring “on the face of it,” it could be said to be high in face validity

  • A perceived lack of face validity may contribute to a lack of confidence in the test

72
New cards

Content validity

How well a test samples behaviors that are representative of the broader set of behaviors it was designed to measure

  • Do the test items adequately represent the content that should be included in the test?

73
New cards

Characteristics of a Criterion

  • An adequate criterion is relevant for the matter at hand, valid or the purpose for which it is being used, and uncontimained, meaning it is not part of the predictor.

74
New cards

What constitutes good face validity?

  • Appearance of Relevance: the items look like they are clearly related to the construct or skill being assessed.

    • For example, a math test should have questions that look mathematical (formulas, calculations) rather than unrelated content.

  • Clear Instructions and Questions: The questions are straightforward and understandable, with no ambiguity or confusion about what is being asked.

  • Transparency: The purpose of the test should be evident to the test-taker, which can make them more motivated and cooperative during the test.

75
New cards

Consequences of lacking face validity

  • Lack of Trust: test-takers may not trust the test or might question its relevance. This can lead to a lack of motivation or cooperation.

    • For example, if a personality test contains seemingly random or unrelated questions, respondents may doubt the test's validity.

  • Decreased Motivation: make individuals less engaged, potentially affecting how seriously they take the test, which could negatively impact the results.

  • Perceived Unfairness: Test-takers might feel that they are being judged based on irrelevant criteria, leading to feelings of unfairness and dissatisfaction.

76
New cards

Why would we not want a test to be face valid?

  • Reducing Social Desirability Bias: If a test is too face valid, test-takers might alter their responses to look better or to meet perceived expectations. This is common in personality or psychological tests, where people might respond in a socially desirable way if they understand exactly what’s being measured.

    • For example, if a test clearly appears to measure honesty, individuals might choose answers they think make them look honest rather than providing true responses.

  • Minimizing Faking: In some scenarios, especially in employment testing, we might not want candidates to know exactly what is being assessed to prevent them from faking good answers.

    • A job skills test with high face validity might lead candidates to prepare in ways that don't truly reflect their skills, affecting the test’s effectiveness.

  • Testing Hidden Constructs: In psychological assessments, certain constructs may be better assessed when individuals are unaware of what is being measured. This helps in capturing more genuine behavior or responses.

    • For instance, subtle tests of cognitive biases or implicit attitudes are less face valid by design to ensure more accurate responses.

77
New cards

concurrent validity

 an index of the degree to which a test score is related to some criterion measure obtained at the same time

78
New cards

predictive validity

an index of the degree to which a test score predicts some criterion or outcome, measure in the future. Tests are evaluated as to their ______.

79
New cards

What is the difference between concurrent and predictive validity?

  • Concurrent validity tells you how well a test aligns with a current standard or established measure.

  • Predictive validity tells you how well a test predicts future outcomes.

Both forms of validity help in determining how useful a test is for different purposes, but they serve distinct roles based on the timing of the criterion being measured.

80
New cards

base hit (predictive validity)

the extent to which the phenomenon exists in the population

  • Example: 55% divorce base rate is a high base rate

81
New cards

hit rate (predicative validity)

accurate identification (True-positive & True-negative)

82
New cards

miss rate (predictive validity)

  • Failure to identify accurately

    • False-positive

    • False-negative

83
New cards

false positive

When a test incorrectly indicates the presence of a condition or attribute when it is actually absent. In other words, the test result is positive, but the true condition is negative.

84
New cards

false negative

When a test fails to detect a condition or attribute that is actually present. Here, the test result is negative, but the true condition is positive.

85
New cards

Type 1 error

AKA false positive. The null hypothesis (H₀) is rejected when it is actually true. The incorrect conclusion that there is an effect or difference when none exists.

Example: Positive COVID test but a person DOES NOT have COVID

86
New cards

Type 2 error

AKA False negative. When the null hypothesis (H₀) is not rejected when it is actually false. This means that a true effect or difference is missed, leading to the conclusion that there is no effect when there actually is one.

Example: Negative COVID test, but the person DOES have COVID

87
New cards

Validity coefficient

a correlation coefficient between test scores and scores on the criterion measure

  • Are affected by restriction or inflation of range.

88
New cards

Incremental validity

the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use

89
New cards

What happens to the validity coefficient when you restrict or inflate the range of scores?

  • Restricted Range: Lowers the validity coefficient, making the test seem less predictive than it might be with a full range of scores.

  • Inflated Range: Can increase the validity coefficient, possibly making the test seem more predictive if the range is artificially broadened.

90
New cards

What is the importance of incremental validity

  • Increases prediction accuracy by identifying unique contributions.

  • Prevents unnecessary redundancy in measurements.

  • Helps allocate resources efficiently.

  • Contributes to a more comprehensive understanding of constructs.

  • Ensures that new measures genuinely enhance validity.

91
New cards

If a test has high construct validity, what does this tell you about the test?

The test is accurately and consistently measuring the intended concept or trait, aligns with theoretical expectations, and appropriately correlates with related measures while distinguishing itself from unrelated ones. It assures that the test's scores are meaningful indicators of the construct in question.

92
New cards

Evidence of homogeneity (Construct V)

 how uniform a test is in measuring a single concept

93
New cards

Evidence of changes (Construct V)

Some constructs are expected to change over time (e.g., reading rate)

  • How stable is it? Be able to show it with your measure

94
New cards

Evidence of pretest/post test changes (Construct V)

test scores change as a result of some experience between a pretest and a posttest (e.g., therapy)

  • Dynamic assessment. Not just stability, manipulation of the environment etc. you would expect changes.

95
New cards

Evidence from distinct groups (Construct V)

Scores on a test vary in a predictable way as a function of membership in some group (e.g., scores on the Psychopathy Checklist for prisoners vs. civilians).

96
New cards

Convergent evidence (Construct V)

Correlates highly in the predicted direction with scores on previously sychometrically established tests designed to measure the same (or similar) constructs

97
New cards

Discriminant evidence (construct v)

Showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated

98
New cards

Factor Analysis (Construct V)

A new test should load on a common factor with other tests of the same construct.

99
New cards

What is bias?

a factor inherent in a test that systematically prevents accurate, impartial measurement

100
New cards

What is fairness?

The extent to which a test is used in an impartial, just, and equitable way.