The science concerned with evaluating the attributes of psychological tests and the procedures used to estimate and evaluate the attributes of tests. Involves reliabilty and validity.
2
New cards
Psychological construct
An unobservable hypothetical entity that is used to represent a pattern of psychologically related phenomena. Nature is abstract.
3
New cards
Tests or indicators used to represent a construct
Should be __linked theoretically__ to the definition of the construct.
4
New cards
Psychological test
A systematic procedure for comparing the behavior of two or more people.
5
New cards
Criteria of tests
1) Tests involve behavioural samples of some kind.
2) The behavioural samples must be collected in some systematic way.
3) The purpose of tests is to compare behaviours of two ore more people.
6
New cards
Inter-individual differences
__Comparing__ the cognition/behaviour of different people.
7
New cards
Intra-individual differences
Measure changes in the behaviour of the same people across time.
8
New cards
Types of tests
1: Content (e.g., skill, personality, attitudes)
2: Types of responses (e.g., multiple-choice, open-ended)
3: Administration procedure (e.g., individual vs. group)
4: Intended purpose (criterion vs. norm referenced)
5: Time constraints (speed vs. power)
9
New cards
Criterion referenced tests
Involve a __cut-off score__ that specifies whether someone has achieved a sufficient level of the skill or capacity. Used in contexts where a decision must be made about someone’s __skill level__.
10
New cards
Norm referenced test
Involve comparing a person’s score with that of a __reference sample__. Reference sample is considered to be __representative__ of the population of interest. Compared against the average score.
11
New cards
Scaling
The way in which numerical values are assigned to psychological attributes.
12
New cards
Properties of Identity
Differentiate categories of people who share a psychological feature (do and don’t). Numbers may be used to __label the categories__ - do __not have any inherent meaning__ or true mathematical value.
1. Categories must be mutually __exclusive__. 2. Categories should be exhaustive. 3. All people classified within a given category must be identical with respect to the attribute of interest.
13
New cards
Properities of Order
More informative - Give the person with the highest level of an attribute a rank of 1, or alternatively, a rank of N (i.e., where N equals the total sample size).
14
New cards
Properties of Quantity
Reflects the ability of numerals to provide information about the magnitude of differences between people. Numerals used in measurment are real numbers that indicate the amount of something. Can compare cases with each other in a meaningful way.
15
New cards
The number zero
Some are absolute, whereas others are relative (or arbitrary).
16
New cards
Additivity
Implies that the unit size of a measurement does not change as the units are being counted. When perfectly satisified is known as __conjoint measurement__.
Rarely satisfied in psychology.
17
New cards
Counting
Is a necessary but not sufficient condition for measurement.
18
New cards
Measurement
The assignment of numerals to objects or events according to rules. Numbers can be __assigned__ to __represent__ the quantities of psychological attributes.
Includes Steven’s four levels.
19
New cards
Nominal scales
Symbols or numerals that have a property of identity are used to label observations in which behaviors have been sorted into categories according to some psychological attribute.
Has identity.
20
New cards
Ordinal scales
Links observations of behaviour thought to reflect qualitative differences in amounts of an attribute to symbols or numerals that have the __property of order.__
Has identity and order.
21
New cards
Interval scales
A constant distance between each of the units. However, there is no meaningful zero point.
Has identity, order and quantity.
22
New cards
Ratio scales
Same properites as interval but can say that a case has, say, 37% more of the attribute of interest than another case.
Has identity, order, quantity and absolute zero.
23
New cards
Dispersion
Represents the amount of variability in the data. Includes range, SD and variance.
24
New cards
Correlation
Standardized representation of the __association__ between two variables
Can range from -1.0 to 1.0
25
New cards
Composite Variables
Test __scores__ based on the __sum__ of __two__ or more items. Also called sum scores.
26
New cards
The variance of a composite score is a function of
1. The variance associated with the individual items.
2. The correlations amongst the items.
27
New cards
Composite variables - Variance
28
New cards
Binary Items
Dichotomous items - correct or incorrect answer.
Responses are scored 0 or 1.
29
New cards
Binary items - Variance
The variance of a dichotomously scored item is maximized when half of the people score 1 and the other half score 0.
30
New cards
Relative interpretations
Based on the __analysis of data__.
1: Knowing the mean of the distribution of scores
2: Knowing the standard deviation of the distribution of scores
31
New cards
Abstract interpretations
Based on the __theoretically relevant characteristics__ of the body of research which supports the test scores as valid indicators of a psychological construct.
32
New cards
Limitations of raw scores
Need to be __converted__ into standardized scores which incorporate information about the mean and standard deviation. More precise method.
33
New cards
*Z* scores
Have a mean of zero and a *SD* of one.
__Useful__ for the purposes of rendering inherently meaningless raw scores into easily __interpretable relative scores__.
34
New cards
T scores
Involves rescaling z scores so that the converted scores have a different mean and standard deviation.
Mean is __always 50__ and the standard deviation is __always 10__.
Used in MMPI.
1. Convert the raw scores into *z*-scores 2. Convert the *z*-scores using the following formula: T = z(10)+50
35
New cards
Percentile ranks
Indicate the __percentage__ of scores that are __equal to__ or __below a specific test score__.
Assumes that the distribution of raw scores is perfectly normal.
36
New cards
Probability samples
Use procedures that ensure a representative sample.
37
New cards
Random sampling
Type of probability sample, as you would expect a random sample from the population to be representative of the population.
Rarely happens in practice.
38
New cards
Closed-ended
Questions ask respondents to choose from a fixed set of alternatives. For example, yes/no questions.
39
New cards
Open-ended questions
Do not provide respondents with any response alternatives. Given space to respond in their own words.
This may take a narrative form (i.e., long-answer) or short-answer
40
New cards
Rating Scales
The most widely used response format such as the Likert scale. Likert scales use words to ‘anchor’ the numerical ratings. Most surveys use between 4 and 7 points.
Respondents are not capable of making the distinctions between adjacent points in a rating scale with 11 or more points.
6-7 points is best → 5 points is the bare mimimum for a continously measured variable.
41
New cards
Reliability
Pertinent to the __consistency__ of measurement.
The degree of correspondence between __observed scores__ and __true scores__. Theoretically, the correlation between observed scores and true scores.
A __necessary but not sufficient condition for validity__.
The __discrepancy__ between observed scores and true scores is considered to be due to __measurement error__.
Opposite to measurement error.
A property of test scores (from samples), not tests, per se.
42
New cards
Classical Test Theory
A measurement theory that __defines the conceptual basis of reliability__.
Specifies procedures for estimating the reliability of scores derived from a psychological test or instrument.
43
New cards
True scores
A __hypothetical score__ devoid of measurement error.
Conceived in the context of a particular test or instrument – not a construct. Not “construct scores”.
\ May be “perfect”, but this is __only true in the context of measurement error__ associated with data derived from a particular test or instrument. We assume they are devoid of measurement error.
44
New cards
Observed scores
The scores we obtain __from tests__ or instruments.
We want the ______ scores to be as close to their corresponding true scores as possible.
\ Want a __large (positive) correlation__ between observed scores and true scores
Want observed scores and error scores to be __uncorrelated__.
If observed scores and error scores are correlated highly, it means they are measuring the same process: error.
45
New cards
An observed score is
True score + measurment error.
46
New cards
Error scores
Should have a __mean of zero__ - Should be just as many people that should have an observed score that is too large as too small (and the magnitudes should be the same).
\ Error scores should be a __random process__ - should not correlate with anything (except possibly their corresponding observed scores).
\ Error scores should be __uncorrelated with true scores__ - “extraneous” error related factors should affect people equally.
47
New cards
1. Reliablity is the ratio of true score variance to observed variance
The ratio of SS Effect to SS Total
Conceptually, in the reliability case, it is the ratio of SS True to SS Observed
48
New cards
2. Reliability is the (squared) correlation between observed scores and true scores
When you __square the reliability index__, you get a conceptual estimation of reliability.
Conceptual - it is not possible to know what the true scores are, so we’ll never be able to estimate the actual correlation.
49
New cards
3. Reliability is the lack of error variance
Subtract the ratio of error variance to observed variance by 1 to place in the same context of reliability (rather than error).
50
New cards
4. Reliability is the lack of correlation between observed scores and error scores.
If reliability is the correlation between true scores and observed scores, then it is necessarily the case that it is the absence of a correlation between observed scores and error scores.
51
New cards
Parallel tests
Two tests are considered ____ if they are __identical__ to each other __psychometrically__, but __differ__ in the actual __items__ that make up each test.
Tau equivalence - a person’s true score on one test would be expected to be __identical__ on the other test. Assumes equal error variances between the two tests.
52
New cards
Reliability guidelines
According to Nunnally.
Exploratory Research = .70 or higher
Basic Research = .80 or higher
Applied Cases = .90 or even .95 or higher
53
New cards
Validity
Represents the degree to which the observed scores from a test __actually represent__ the attribute (or construct) of interest.
54
New cards
Alternative forms reliability
Same as parallel forms.
Carry-over effects of concern - would have to wipe the memory of test taker
55
New cards
Test-retest reliability
Create only __one test__, but administer it on __two different occasions__.
The same items are presented on both occasions, so the true scores should represent the same construct. Assume the construct of interest to be a stable one.
All other assumptions associated with parallel forms apply.
56
New cards
Equal error variances
If you had people sit the test under the same conditions (e.g., same room, same time of day), you might be able to assume that responses are affected by error to the same degree.
57
New cards
Test-retest interval
All other things equal, the __magnitude of the interval__ between the two testing sessions will affect the magnitude of the correlation between the scores.
Also known as ‘__stability coefficient__’.
58
New cards
Internal-Consistency Reliability
P__ractical alternative__ to the alternative forms procedure and the test-retest procedure. Respondents complete __one form of the test__ → diff items = diff forms.
\ 1\. The __length__ of the test - A longer test (more items) will be found to yield more reliable scores than a shorter test.
2\. The degree of __consistency__ between the parts/items in the test.
59
New cards
Spearman-Brown formula
Makes an __adjustment__ to the internal consistency reliability estimate to reflect the actual length of the test.
60
New cards
Split-half issues
You get a different reliability estimate depending on how you specify the two halves of the test.
Very unlikely to __get the same correlation__ as in the first split-half example.
61
New cards
Cronbach’s alpha
A reliability formula that represents the reliability of __all possible split-halves__.
The ratio of true score variance to total variance, which is how reliability has been defined at the theoretical level.
62
New cards
Cronbach’s Alpha Assumptions
1. Indicators (items) are essentially tau-equivalent - each item is an equally strong indicator of the true score scores (may differ by constant). 2. Each item’s error term is uncorrelated with every other item’s error term - For items that are more alike, a positive correl exists (Cronbach can’t deal with this so it pretends it doesn’t exist). 3. Error scores are uncorrelated with the true scores. 4. Items used to generate a composite score measure only one attribute or construct - unidimensionality.
63
New cards
Standardized Coefficient Alpha
Should be applied to scores that have been converted from raw scores to standardized scores.
64
New cards
Kuder-Richardson 20 formula
Introduced to estimate the internal consistency reliability associated with composite scores based on dichotomously scored items.
Cronbach’s alpha formula works just fine on dichotomously scored items.
Unequal true score variance (but all greater than zero), unequal means, unequal error variances.
Least strict.
69
New cards
Cronbach’s alpha is a lower-bound estimate
If the assumption of essential tau-equivalence is not satisfied, then Cronbach’s alpha will tend to __underestimate reliability__.
Substantial in smaller scales.
The larger the sample size, the greater the confidence.
* 0.7 = Sample of 400. * 0.9 = Sample of 100.
70
New cards
Relationship between number of items and reliability
When the mean inter-item correlation is .30, it can be observed that the level of Cronbach’s alpha rises relatively rapidly from 2 items to about 8 items.
After about 15 items, the increase is much more gradual and arguably not sufficiently beneficial in most cases.
71
New cards
Sample Homogeneity
A __more homogenous__ sample will yield lower reliability estimates than a __heterogeneous__ sample.
This is because greater homogeneity implies less variance. Less variance implies smaller inter-item correlations, all other things equal.
72
New cards
Coefficient omega (*ω*)
A modern approach to estimating internal consistency reliability. Introduced by McDonald (1999)
Sometimes referred to as McDonald’s omega. Unlike coefficient alpha, it does not assume essential tau-equivalence.
A lower-bound estimate of reliability.
73
New cards
Standard error of measurement
Represents the amount of error “around” a __point estimate__ in standard deviation form.
A point-estimate is one’s “best guess” of what a person’s score is on the test.
A __confidence interval__ can be estimated around a point-estimate - reflects a range of values that is often interpreted as a range in which the true score is likely to fall.
74
New cards
Is high internal consistency all we want?
While high levels of reliability are desired, they are __not the only consideration__.
Tests with extremely high levels of internal consistency may be __extremely narrow__ in breadth.
75
New cards
Observed score correlation
The correlation you get based on the data you have. Will be compromised to the degree to which there is measurement error in your data.
“Compromised” (or __attenuated__) to the degree that the scores are associated with __less than perfect reliability__.
76
New cards
True score correlation
A hypothetical correlation you can estimate, if you know the reliabilities associated with the scores. Not compromised by measurement error.
77
New cards
R max
The maximum possible correlation between two variables is equal to the square root of the product of their respective reliabilities.
78
New cards
Correction for Attenuation Formula
What the correlation would have been had the measures provided perfectly reliable scores?
The ratio of the observed correlation to the square root of the product of the reliabilities (i.e., *r*max).
79
New cards
Validity
The degree to which a test measures __what it is supposed to measure__. M__ost important__ issue in psychological measurement
Or the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test.
Related to the __proposed uses__ of the scores.
80
New cards
Construct validity
The degree to which test scores can be interpreted as __reflecting__ a particular psychological __construct__.
81
New cards
Test content
Represents the match between the __actual content__ of the test and the __content that should be__ included in the test.
The __description__ of the nature of the construct should __help define__ the appropriate content of the test.
Includes construct and face validity.
82
New cards
Content Validity
When the items __cover the entire breadth__ of the construct. Items can __not exceed the boundaries__ of the construct.
Constructs such as personality __suffer from a lack of clear boundaries__.
Assessed by experts in field.
83
New cards
Domain sampling theory
If 110 questions in an exam represent a __random assortment__ of material covered across the entire unit, __the correlation__ between performance on a 110 final exam and a 1500 final exam would likely __be very high__.
84
New cards
Standard Error of Estimate
Allows for finding validity of a short form vs a long form of a particular test.
85
New cards
Face Validity
Face validity represents the degree to which the items associated with a measure __appear__ to be related to the construct of interest.
__Not crucial__ from a fundamental psychometric perspective More of a __practical consideration__.
Assessed by non-experts.
People can also respond in a way that they think is __most advantageous__ for them.
86
New cards
Factorial Validity
The internal structure of a test - its design and what the test actually measures.
1\. The number of dimensions measured by a test.
2\. Whether the items of a test are related to the dimensions of interest.
3\. Whether the dimensions of interest are related to each other.
The ______ (hypothesized structure) of a test can be tested using various __quantitative techniques__, such as factor analysis.
Determining whether the scores of a measure correspond to the __number and nature of dimensions__ of theorized dimensions that underlie the construct of interest.
Can determine what factors “load” onto a dimension of interest.
87
New cards
Response Processes
Should be a __close match__ between the psychological processes that the respondents __actually use__ when completing a measure and the process that they __should use__.
Cheating is an example behaviour that seriously comprises a researcher’s capacity to interpret the scores as valid indicators of performance.
88
New cards
Convergent validity/evidence
The degree to which test scores __are correlated__ with tests of __similar__ constructs. Should be a __positive correlation__ between your self-reported scores and the rater-reported scores (convergent validity).
89
New cards
Discriminant validity/evidence
The degree to which test scores are __uncorrelated__ with tests of unrelated constructs.
It helps to know what a construct __is not__ in the process of its validation. Constructs should not correlate with everything under the sun.
90
New cards
Concurrent Validity Evidence
Observed when scores from one measure correlate in a theoretically meaningful way with the scores of another measure considered to be the “__gold standard__”.
Does not have to based on measures administered __precisely at the same time__ - but the time period __should be very close__.
N__ot particularly impressive__.
91
New cards
Predictive Validity Evidence
The degree to which test scores are correlated with relevant variables that are measured at a __future point in time__.
V__ery impressive__ - relatively rare.
92
New cards
Consequential Validity
S__ocial/personal consequences__ associated with using a particular test.
93
New cards
Criterion Validity
The observation of an association between a psychometric measure and a relevant outcome variable, such as different groups.
Sometimes referred to as ‘criterion groups validity’.
The __oldest form of validity__.
For example, diagnosed with a clinical disorder versus not diagnosed with a clinical disorder
94
New cards
Induction-Construct Development Interplay
A measure is developed solely from an __inductive perspective__ (essentially through discovery).
Lot of refinement in the model along the way.
Eg. Dictionary example about personality.
95
New cards
Measurement as Theory
The primary objective of validation research is to offer a theoretical explanation of the processes that lead up to the measurement outcome.
Measurement considered a __fundamental theory development__ end in its own right.
Hope/assume that the measures we use yield scores that are valid indicators of the constructs of interest.
Constructs and __well articulated theories__ play a primary role in measurement and in psychology more generally.
96
New cards
Factor analysis
A data analytic technique used to help determine the number and nature of dimensions associated with the scores derived from a test.
1. Helps us clarify the number of factors within a set of items (or indicators).
2. Helps us determine the nature of the associations among the factors. 3. Helps us determine which items are linked to which factor, which facilitates the interpretation of those factors.
97
New cards
Unidimensional test
Consists of items which all measure one, single factor.
Each item is linked to one attribute only.
The degree to which each indicator is linked to the attribute can be estimated with __factor analysis__ or __component analysis__.
98
New cards
Multidimensional test (uncorrelated)
Consists of items which measure two or more dimensions which are unrelated to each other.
N__o link__ between the two dimensions (components/ factors). You would calculate a __separate score__ for each dimension
U__nacceptable__ to calculate a __total score__
99
New cards
Multidimensional test (correlated)
Consists of items which measure two or more dimensions which are correlated with each other (positively or negatively).
100
New cards
Higher-order model.
An overall (global) component is included, instead of a link between the two dimensions.