1/108
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
what is reliability?
consistency of measurement
what is reliability coefficient?
is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance
what are the components of an observed score?
true score + measurement error (X= T + E)
error
refers to the component of the observed score that does not have to do with the testtakers true ability or trait being measured
variance
standard deviation squared
variance equals
true variance plus error variance
measurement error
all of the factors associated with the process of measuring some variable, other than the variable being measured.
What is random error?
a source of error in measuring a targeted variable caused by
UNPREDICTABLE FLUCTUATIONS and inconsistencies of other variables in the measurement process (i.e., noise)
What is systematic error?
a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.
Test Construction
Variation may exist within items on a test or between tests (i.e., item sampling or content sampling).
Test Administration
Sources of error may stem from the testing environment.
Testtaker Variables
pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication
Examiner-related variables
physical appearance, demeanor, eye contact
Test Scoring and Interpretation
Computer testing reduces errors in test scoring, but many tests still require expert interpretation (e.g., projective tests).
Sampling Error
the extent to which the sample in the study actually is
representative of the population
Methodological error
training, ambiguous wording in questionnaire,
biased framing of questions
Test-retest reliability
a method for determining the reliability of a test by comparing a test taker's scores on the same test taken on separate occasions (same person takes the test twice)
What kind of variable is the test-retest appropriate for? inappropriate for what kind of variable?
personality (stable); mood (changes)
Coefficient of Stability
With intervals over 6 months, the estimate of test-retest reliability is called the
coefficient of equivalence
The degree of the relationship between various forms of a test.
parallel forms reliability
for each form of the test, the means and the variances of observed test scores are equal
alternate forms reliability
different versions of a test that have been constructed so as to be parallel.
Do not meet the strict requirements of parallel forms but typically item content and difficulty is similar between tests
How is reliability checked?
By administering two forms of a test to the same group. Scores may be affected by error related to the state of testtakers (e.g., practice, fatigue, etc.) or item sampling
split-half reliability
is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
3 steps of split half reliability
1. Divide the test into equivalent halves
2. Calculate a Pearson r between scores on the two halves of the test
3. Adjust the half-test reliability using the Spearman-Brown formula
spearman-brown formula
allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.
inter-item consistency
The degree of relatedness of items within a test. Able to gauge the HOMOGENEITY of a test
Kuder-Richardson formula 20
statistic of choice for determining the inter-item consistency of DICHOTOMOUS ITEMS
coefficient alpha
mean of all possible split-half correlations, corrected by the Spearman-Brown formula. The most popular approach for internal consistency.
what are the coefficient alpha values?
Values range from 0 to 1
inter-scorer reliability
The degree of the agreement of consistency between two or more scores (or judges or raters) with regard to a particular measure
-It is often used with behavioral measures
-Guards against biases or idiosyncrasies in scoring
coefficient of inter-score reliability
The scores from different raters are correlated with one another.
the more __________ the closer together the items you are creating are similar, THE HIGHER THE RELIABILITY IS
homogeneous
How quickly is it changing, how static is it and this will change the ….
TEST-RETEST RELIABILITY
true score
a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test.
true-score model is often referred to as
Classical Test Theory (CTT)—Perhaps the most widely used model due to its simplicity.
Generalizability theory
a person’s test scores may vary because of variables in the testing situation and sampling.
item-response theory
Provides a way to model the probability that a person with X ability will be able to perform at a level of Y.
What is standard error of measurement?
often abbreviated as SEM, provides a measure of the PRECISON of an OBSERVED TEST SCORE. An estimate of the amount of error inherent in an observed score or measurement.
do we want small or large standard error of measurement?
Small
the higher the reliability of the test…
the lower the standard error
The standard error can be used to estimate the extent to which an observed score
DEVIATES from a true score
confidence interval
a range or band of test scores that is likely to contain the true score
Confidence level of 95%
2 z-value
What are some cultural considerations in test construction/standardization?
Check how appropriate are they for use with the targeted test taker population
When interpreting test results it helps to know about the culture and era of the test-taker
It is important to conduct a culturally-informed assessment
Describe the difference between random error and systematic error
Random error is unpredictable and varies each time, causing results to go up or down by chance.
Systematic error is consistent and always skews results in the same direction.
Random error affects consistency, while systematic error affects accuracy.
When is it appropriate to use test-rest reliability?
Use when you want to assess the stability of a test over time. This index is most suitable when the characteristic being measured is expected to remain stable (e.g., intelligence, personality traits).
Example: A researcher administering the same personality test to the same group of people at two different times to see if the scores remain consistent.
When is it appropriate to use parallel or alternate forms reliability?
Use when you want to check the consistency between two different versions of the same test. It’s ideal when repeated testing could lead to practice effects (people getting better just because they remember the questions).
Example: Creating two different versions of a math test to ensure that both versions measure math skills equally well.
When is it appropriate to use split-half reliability?
Use when you want to measure the internal consistency of a test. The test is split into two halves, and the scores are compared. It’s suitable for tests where all items are meant to measure the same underlying construct.
Example: A long questionnaire is divided into odd and even items, and the correlation between these halves is assessed.
When is it appropriate to use inter-item consistency?
Use when you want to determine the degree to which items on a test measure the same construct. It’s suitable for tests or questionnaires with multiple items aiming to assess the same concept.
Example: Checking if all questions in a depression inventory are related to the same concept of depression.
When is it appropriate to use coefficient alpha (cronbach’s Alpha)?
Use when you want a measure of overall internal consistency for a test. This is suitable when you have a test or survey with multiple items, all aimed at measuring the same construct, and you need a single reliability estimate.
Example: Assessing the reliability of a 20-item scale that measures self-esteem.
When is it appropriate to use inter-scorer reliability?
Use when you want to determine the consistency of scores assigned by different raters or judges. This is crucial when the scoring involves subjective judgment.
Example: Multiple judges rating the creativity of children's artwork and checking the agreement among their scores.
How does homogeneity vs heterogeneity of test items impact reliability?
homogeneous items increase reliability because they reflect the same construct, leading to higher internal consistency. In contrast, heterogeneous items decrease reliability if measured by indices assuming unidimensionality, but they might still provide a comprehensive assessment of a broader concept.
What is the relation between the range of test scores and reliability?
a broader range of scores usually leads to higher reliability, while a restricted range reduces reliability estimates, potentially underestimating the test's true ability to measure what it’s supposed to.
What is the impact of a speed test or power test on reliability?
power tests (questions get harder as the tests go on) and speed tests (complete as many questions as you can under a certain amount of time) are NOT expected to have high reliability
therefore, reliability is not expected to be entirely useful, and people would not readily lean on them
Assumptions of Classical Test Theory (CTT)
people will have a true score on the construct
there will be error in every true score
error are not correlated with the true score (random)
errors are normally distributed
errors cancel themselves out
the greater items the greater reliability
Pros and Cons of Classical Test Theory (CTT)
Pros:
the assumptions are easily met
it is simple and easier to understand
most psychological tests are developed using this
Cons:
assumes the items on a test are all equal in their ability to measure the construct
long measures
Assumptions of Item Response Theory (IRT)
Unidimensionality: assumes that a single latent trait or ability (e.g., math ability or reading comprehension) primarily drives responses to all items on a test.
Local Independence: if you know a person’s ability level, knowing their response to one item shouldn’t give additional information about their response to another.
Monotonicity: For each item, the probability of a correct response increases as the latent trait level increases. This means that higher levels of the trait should lead to a higher likelihood of getting the item right (or agreeing with it in the case of attitude measures).
Invariance of Item Parameters: Item parameters (like difficulty and discrimination) are assumed to be the same across different groups, meaning that items function similarly regardless of the specific sample of test-takers.
What is difficulty and discrimination in IRT?
Difficulty determines where on the ability scale the item is targeted.
Discrimination determines how sharply the item distinguishes between different ability levels.
A well-designed IRT-based test aims to have items with a range of difficulties and high discrimination to accurately measure individuals across the ability spectrum.
SAMPLE PROBLEM (calculate the confidence interval if given the standard error of measurement is 4 and the confidence level index (e.g., 95% confidence level z-score of 2). The observed score is 78.
Observed score= 78, SEM= 4, z-score for 95%= 2:
78 ± (2×4) so 78 ± 8
The confidence interval is 70 to 86
SAMPLE PROBLEM #2 for CONFIDENCE INTERVAL: A standardized test has a standard error of measurement (SEM) of 3 points. If a student's observed score is 85, calculate the 95% confidence interval for the student's true score. Use a z-score of 2, which corresponds to a 95% confidence level.
Observed score= 85, SEM= 3, z-score for 95% confidence=2
85 ± (2×3) so 85 ± 6 which means the confidence interval is 79 to 91 so the student’s true score lies within that range
How to use confidence interval information to interpret test scores?
Understanding Accuracy: A narrow confidence interval indicates a more precise estimate of the true score, while a wide interval suggests less precision.
Comparing Scores: When comparing two students' test scores, overlapping confidence intervals may suggest that the difference in their scores isn't significant, while non-overlapping intervals imply a real difference.
Decision-Making: Confidence intervals can help decide whether a student's score meets a cutoff or threshold (e.g., passing/failing), as it takes into account measurement error. For example, if a passing score is 80 and a student's CI is 78 to 84, there's a chance their true score is both above and below the passing threshold.
What is the standard error of the difference and how does it help us with interpretation of two scores?
A measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant.
It helps us interpret two score because can see if there was an improvement or change. (If they are statistically different, then the true scores cannot overlap)
Three questions that standard error of difference can help us answer:
1. How did this individual’s performance on test 1 compare to their own performance on test 2?
2. How did this individual’s performance on test 1 compare with someone else’s performance on test 1?
3. How did this individual’s performance on test 1 compare with someone else’s performance on test 2?
What is validity?
a judgment or estimate of how well a test measures what it is supposed to measure within a particular context
Relationship between reliability and validity
Reliability is about consistency, while validity is about accuracy. You can have a test that is reliable but not valid, but you can't have a valid test that isn't reliable.
Content validity
Evaluation of the subjects, topics, or content covered by the items in the test
Typically established by recruiting a ___team of experts_________ on the subject matter and obtaining ___expert ratings_______ on the degree of item importance as well as scrutinizing what is missing from the measure
Criterion-related validity
Evaluating the relationship of scores obtained on the test to scores on other tests or measures
A criterion is the standard against which a test or a test score is evaluated
Construct validity
This is a measure of validity that is arrived at by executing a comprehensive analysis of:
how scores on the test relate to other test scores and measures
how scores can be interpreted within a theoretical framework that explains the construct the test was designed to measure
“Pyramid of validity” order
Construct validity
Criterion-related validity
Content validity
Face validity
a judgment concerning how relevant the test items appear to be
If a test appears to measure what it is supposed to be measuring “on the face of it,” it could be said to be high in face validity
A perceived lack of face validity may contribute to a lack of confidence in the test
Content validity
How well a test samples behaviors that are representative of the broader set of behaviors it was designed to measure
Do the test items adequately represent the content that should be included in the test?
Characteristics of a Criterion
An adequate criterion is relevant for the matter at hand, valid or the purpose for which it is being used, and uncontimained, meaning it is not part of the predictor.
What constitutes good face validity?
Appearance of Relevance: the items look like they are clearly related to the construct or skill being assessed.
For example, a math test should have questions that look mathematical (formulas, calculations) rather than unrelated content.
Clear Instructions and Questions: The questions are straightforward and understandable, with no ambiguity or confusion about what is being asked.
Transparency: The purpose of the test should be evident to the test-taker, which can make them more motivated and cooperative during the test.
Consequences of lacking face validity
Lack of Trust: test-takers may not trust the test or might question its relevance. This can lead to a lack of motivation or cooperation.
For example, if a personality test contains seemingly random or unrelated questions, respondents may doubt the test's validity.
Decreased Motivation: make individuals less engaged, potentially affecting how seriously they take the test, which could negatively impact the results.
Perceived Unfairness: Test-takers might feel that they are being judged based on irrelevant criteria, leading to feelings of unfairness and dissatisfaction.
Why would we not want a test to be face valid?
Reducing Social Desirability Bias: If a test is too face valid, test-takers might alter their responses to look better or to meet perceived expectations. This is common in personality or psychological tests, where people might respond in a socially desirable way if they understand exactly what’s being measured.
For example, if a test clearly appears to measure honesty, individuals might choose answers they think make them look honest rather than providing true responses.
Minimizing Faking: In some scenarios, especially in employment testing, we might not want candidates to know exactly what is being assessed to prevent them from faking good answers.
A job skills test with high face validity might lead candidates to prepare in ways that don't truly reflect their skills, affecting the test’s effectiveness.
Testing Hidden Constructs: In psychological assessments, certain constructs may be better assessed when individuals are unaware of what is being measured. This helps in capturing more genuine behavior or responses.
For instance, subtle tests of cognitive biases or implicit attitudes are less face valid by design to ensure more accurate responses.
concurrent validity
an index of the degree to which a test score is related to some criterion measure obtained at the same time
predictive validity
an index of the degree to which a test score predicts some criterion or outcome, measure in the future. Tests are evaluated as to their ______.
What is the difference between concurrent and predictive validity?
Concurrent validity tells you how well a test aligns with a current standard or established measure.
Predictive validity tells you how well a test predicts future outcomes.
Both forms of validity help in determining how useful a test is for different purposes, but they serve distinct roles based on the timing of the criterion being measured.
base hit (predictive validity)
the extent to which the phenomenon exists in the population
Example: 55% divorce base rate is a high base rate
hit rate (predicative validity)
accurate identification (True-positive & True-negative)
miss rate (predictive validity)
Failure to identify accurately
False-positive
False-negative
false positive
When a test incorrectly indicates the presence of a condition or attribute when it is actually absent. In other words, the test result is positive, but the true condition is negative.
false negative
When a test fails to detect a condition or attribute that is actually present. Here, the test result is negative, but the true condition is positive.
Type 1 error
AKA false positive. The null hypothesis (H₀) is rejected when it is actually true. The incorrect conclusion that there is an effect or difference when none exists.
Example: Positive COVID test but a person DOES NOT have COVID
Type 2 error
AKA False negative. When the null hypothesis (H₀) is not rejected when it is actually false. This means that a true effect or difference is missed, leading to the conclusion that there is no effect when there actually is one.
Example: Negative COVID test, but the person DOES have COVID
Validity coefficient
a correlation coefficient between test scores and scores on the criterion measure
Are affected by restriction or inflation of range.
Incremental validity
the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use
What happens to the validity coefficient when you restrict or inflate the range of scores?
Restricted Range: Lowers the validity coefficient, making the test seem less predictive than it might be with a full range of scores.
Inflated Range: Can increase the validity coefficient, possibly making the test seem more predictive if the range is artificially broadened.
What is the importance of incremental validity
Increases prediction accuracy by identifying unique contributions.
Prevents unnecessary redundancy in measurements.
Helps allocate resources efficiently.
Contributes to a more comprehensive understanding of constructs.
Ensures that new measures genuinely enhance validity.
If a test has high construct validity, what does this tell you about the test?
The test is accurately and consistently measuring the intended concept or trait, aligns with theoretical expectations, and appropriately correlates with related measures while distinguishing itself from unrelated ones. It assures that the test's scores are meaningful indicators of the construct in question.
Evidence of homogeneity (Construct V)
how uniform a test is in measuring a single concept
Evidence of changes (Construct V)
Some constructs are expected to change over time (e.g., reading rate)
How stable is it? Be able to show it with your measure
Evidence of pretest/post test changes (Construct V)
test scores change as a result of some experience between a pretest and a posttest (e.g., therapy)
Dynamic assessment. Not just stability, manipulation of the environment etc. you would expect changes.
Evidence from distinct groups (Construct V)
Scores on a test vary in a predictable way as a function of membership in some group (e.g., scores on the Psychopathy Checklist for prisoners vs. civilians).
Convergent evidence (Construct V)
Correlates highly in the predicted direction with scores on previously sychometrically established tests designed to measure the same (or similar) constructs
Discriminant evidence (construct v)
Showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated
Factor Analysis (Construct V)
A new test should load on a common factor with other tests of the same construct.
What is bias?
a factor inherent in a test that systematically prevents accurate, impartial measurement
What is fairness?
The extent to which a test is used in an impartial, just, and equitable way.