1/95
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is dependability?
Reliability
What is the reliability coefficient?
The index of reliability that shows the proportion of the true score variance
What is test score variability?
Refers to how spread out test scores are within the group
What is the difference between systematic and random error?
Random error involves unpredictable fluctuations and destroys reliability; systematic error is biased and constant, destroying the validity.
How is systematic a threat to validity more than reliability?
Because systematic error is constant, it mimics consistency; however, it consistently deviates from the truth.
What are the sources of error?
Item/content sampling, test administration, test scoring and interpretation
What is item/content sampling?
Process of selecting limited set of items to represent a broader domain of knowledge.
What is the difference between internal and external reliability?
Internal reliability refers to the consistency within the test itself; external reliability refers to the consistency of scores across differing circumstances, times, or raters.
What are the sources of error in test-retest reliability estimates?
Time-sampling, carryover/practice effects, maturation/reactivity
Which reliability coefficient is entirely free from content-sampling error?
Test-retest; because the content remains constant
What is the difference between carryover and practice effects?
Carryover is when emotional or physical states from the first test persist into the second, either increases or decrease test scores; practice is a type of carryover where scores increase due to memory or skill gain, usually increasing the test scores
How do test sophistication and test wiseness differ from carryover and practice effects?
Test sophistication: Familiarity with a specific test or type of test that can boost scores (knowing the format, structure, or typical items). Test wiseness: General strategies for answering test questions effectively, regardless of content (e.g., eliminating wrong choices, time management). Carryover effects: When performance on one test influences performance on a later test (e.g., memory of items, lingering effects). Practice effects: Improved performance simply from repeated exposure to the test, not because of learning the content but because of familiarity with the task.
What is the difference between parallel and alternate forms?
Parallel forms: mathematically identical (same means, variances, and errors). Alternative: closely matched but not mathematically identical
How does counterbalancing help with carryover effects?
Counterbalancing helps by neutralizing the order in which conditions are presented so that carryover effects (like fatigue, practice, or boredom) are shared equally across all groups.
What correlation coefficient do we use for parallel/alternate forms reliability?
Pearson r
If test-retest = coefficient of stability, what is for parallel/alternate forms?
Coefficient of equivalence
What is the difference between KR20 and KR21 and when do we use them respectively?
KR20 is used for dichotomous items with different levels of difficulty.
KR21 is used for dichotomous with same level of difficulty.
What is the difference between Cronbach’s Alpha and McDonald’s Omega?
Cronbach's Alpha: Assumes all items contribute equally to the score (tau-equivalence). McDonald’s Omega: Allows for different item contributions, giving a more flexible measure of consistency.
Define tau equivalence and explain what happens if it’s violated according to Cronbach’s Alpha?
Tau-equivalence means that each item on a test contributes equally to the true score, with differences only due to random error. If it’s violated, items have unequal contributions, leading to a lower Cronbach’s alpha and less reliable test scores overall.
What is Average Proportional Difference?
The percentage that represents the difference between two values. This is so you can easily see and compare differences in a standardized way
What is the difference between Spearman Brown and Rulon’s Formula?
The Spearman-Brown formula is typically used to estimate reliability when the two halves of a test have about equal variance. The Flanagan-Rulon formula is used when the two halves of the test have unequal variances. It adjusts for differences in standard deviations, making it more suitable when the halves are not equally distributed
When do we use Kappa Statistics and Kendal’s W.?
Kappa is for categorizing nominal data; Kendal's is for ranking ordinal data
When do we use Fleiss and Cohen’s Kappa?
Cohen's is used for 2 raters; Fleiss is used for 3 or more raters
What is the difference between criterion- and norm-referenced test?
Norm-referenced test involves comparing an individual's scores against other people's performance; criterion-referenced test involves comparing test scores against a criterion or predetermined standard.
Differentiate CTT, Domain Sampling, Generalizability, and IRT.
CTT states that observed scores comprise true scores plus error.
Domain Sampling states that adding items increases reliability by better representing the content
IRT improves on CTT by providing sample-independent item statistics at the cost of much larger sample sizes.
What is the concept of sample-dependency all about?
In CTT, if the sample has high ability, the items appear easy; conversely, if the sample has lower ability, the items appear difficult. Essentially, item statistics are contingent on the specific sample taking the test rather than the item itself.
How does Domain Sampling relate with CTT
Domain Sampling explains why CTT rules work: A bigger scoop (more questions) gives a more accurate sample of your knowledge, which naturally cuts out error and boosts reliability.
What is another term for IRT?
Latent-trait theory
Why is IRT better than CTT?
Because IRT is sample-independent, an item's parameters (such as difficulty and discrimination) remain invariant regardless of the ability level of the sample taking the test.
If test-retest = coefficient of stability, what is for interrater, split-half, and inter-item reliability?
Test-retest reliability → Coefficient of Stability (consistency of scores over time). Interrater reliability & Parallel/Alternate forms reliability → Coefficient of Equivalence (agreement between different raters/observers). Split-half reliability → Coefficient of Internal Consistency (consistency between two halves of a test). Inter-item reliability → Coefficient of Homogeneity (consistency among items within the same test).
What is the equivalent of 土1 and 土2 SEM in confidence interval?
±1 SEM ≈ 68% confidence interval → The true score is likely within 1 SEM above or below the observed score. ±2 SEM ≈ 95% confidence interval → The true score is very likely within 2 SEMs above or below the observed score.
How do we differentiate SEM, SED, SEE?
SEM (Standard Error of Measurement): Focuses on a single score; shows how much error is in one observed score. Linked to reliability. SED (Standard Error of Difference): Compares two scores; indicates whether the gap between them is real or significant. SEE (Standard Error of Estimate): Evaluates prediction accuracy; shows the average amount of error in predicted scores. Linked to validity.
How do we differentiate reliability and validity?
Reliability: Consistency of measurement. A test is reliable if it produces stable, repeatable results (like a scale giving the same weight each time).
Validity: Accuracy of measurement. A test is valid if it actually measures what it claims to measure (like a scale measuring weight, not height).
What is the relationship between reliability and validity?
A test can be reliable but not valid (e.g., consistently wrong), but it cannot be valid unless it is reliable.
What is the concept that refers to a judgment regarding how well the test measures what it purports to measure at the time and place the variable is naturally being emitted?
Ecological validity
Differentiate external validity and ecological validity.
External Validity: Who, Where, and When (Generalizability). Ecological Validity: The Setting (Naturalness vs. Artificiality)
What are the components of ecological validity?
Verisimilitude → Appearance: The degree of resemblance between the test situation and real-world conditions. "Looks like the real life". Veridicality → Accuracy: The degree to which test scores truly reflect or predict actual functioning in the real world. "Works like the real life"
How can we increase internal and external validity?
Internal validity: use random assignment, standardization, and counterbalancing to control extraneous variables; external validity: use diverse yet intentional participants as well as a naturalistic setting
How do external, internal, conceptual, and face validity differ from each other?
Internal: confidence that there is a relationship; external: generalizability of the results. Conceptual: involves a theoretical foundation; face: appearance/judgment based on testtaker
What is the difference between conceptual and construct validity?
Conceptual is the blueprint on which the test is based. Construct ensures that the construct is being measured accurately.
Differentiate the trinitarian view of validity.
Content: whether the test covers all important aspects of the topic. Criterion-related: whether the test can compare to other measures; construct: whether the test align with or contradict the theory.
Define construct underrepresentation and construct-irrelevant variance.
Construct underrepresentation: indicates the test lacks the components of the construct. Construct-irrelevant: indicates that test picks up extraneous constructs which influences the accuracy of the test.
What are the core principles of content validity?
Representativeness, relevance, absence of bias, technical quality
How does content validity associate with construct and criterion validity?
Content validity ensures that the test covers all important aspects of the construct. It then ensures that the results correlate with those of another related measure or against a standard.
What do we use to calculate the content validity of each item according to experts? How is each item rated?
Content validity index measures each item by a 4-point scale
What does Zero CVR mean? How about a positive one?
0 CVR means no consesus between raters; +1 means all raters agree on the item's relevance ; -1 means all raters find the item irrelevant
What is the difference between I-CVI and S-CVI?
I-CVI: focuses on testing individual item's relevance; S-CVI: focuses on providing the summary of all items' relevance
What can criterion-related validity be used for?
Can be used to infer an individual's most probable standing on some measure of interest.
What is the primary difference between concurrent and predictive validity?
Concurrent validity → The degree to which test scores correlate with another measure taken at the same time. Predictive validity → The degree to which test scores forecast or predict performance on a relevant measure taken in the future.
What does high incremental validity indicate?
Incremental validity: The extent to which a new test or measure adds useful information beyond what existing measures already provide. High incremental validity → Shows that the measure contributes unique, additional predictive value not captured by other tests or assessments.
What are the core principles of construct validity?
Clear conceptualization, operationalization, empirical evidence
What are the six (6) evidences of construct validity and how do they differ from each other?
Evidence of homogeneity → Proof that the test measures only one construct (items are internally consistent).
Evidence of changes with age → The construct is expected to change across development, and scores should follow that expected course.
Evidence of pretest-posttest changes → If an intervention targets the construct, test scores should change accordingly.
Known-groups validity → The test should differentiate between groups known to differ on the construct (e.g., clinical vs. non-clinical).
Convergent validity → The test should correlate with other established measures of the same construct.
Discriminant validity → The test should show little to no correlation with measures of different constructs.
What do we use to evaluate both convergent and divergent validity simultaneously?
Multi-trait Multimethod Matrix
What else is the primary purpose of the Multi-trait Multimethod Matrix?
For assessing the construct validity of a set of measures in a study
What is the the function of each correlation in the MTMM matrix?
Monomethod-monotrait: reliability diagonal; should be the highest in the entire matrix. Heteromethod-monotrait: validity diagonal; provides evidence of convergent validity. Monomethod-heterotrait: should be low as it provides evidence of discriminant validity. Heteromethod-heterotrait: should be the lowest as it further supports the discriminant validity
If Heterotrait and Monomethod Triangle is high, this means that the test has __?
Low discriminant validity; because it's only capturing the test-taker’s ability with the method (e.g., response style, format familiarity) instead of uniquely measuring the intended construct.
What are latent variables?
Factors that are not directly observed but are inferred from patterns in responses or behaviors.
How does factor loading work?
A statistical value that shows how strongly a specific item (question) is associated with a latent construct (hidden variable); represents the correlation between the item and the latent construct.
Higher loadings that is close to __ mean the item is a strong indicator of the construct
1
How does factor analysis function?
A statistical technique used to identify latent variables (factors) by examining patterns of correlations among items or questions.
What are the two types of factor analysis?
Exploratory Factor Analysis (EFA) → Used when we have no preconceived idea of the factor structure. It aims to explore and discover the underlying dimensions. It is theory-generating. Confirmatory Factor Analysis (CFA) → Used when we already have a hypothesized factor structure based on theory or prior research. It aims to test and confirm whether the data fit that structure. It is theory-testing.
How does Factor Analysis differ from Principal Component Analysis?
Factor Analysis (FA) → Aims to identify latent variables (factors) by examining correlations among items; focuses on uncovering hidden structure. Principal Component Analysis (PCA) → Aims to reduce dimensionality by summarizing items into components that maximize explained variance; involves data simplification.
What does variance maximization have to do with PCA?
The more variance a component captures, the more information it retains from the original dataset, making summarization more effective.
PCA forces data into a shape that doesn't overlap as it follows the concept of __?
Orthogonality
Define eigenvalue and where does it lie in the scree plot?
Eigenvalue → A numerical value that represents the amount of variance explained by a factor or principal component; it is located in the y-axis.
For every item, the eigenvalue is equivalent to __. Additionally, what should we do with factors that have eigenvalues lower than that?
1.0; we discard those eigenvalues that are lower than 1.0
Which criterion/rule explains that we should retain factors that have an eigenvalue great than 1.0?
Kaiser criterion/K1 rule
Why is Kaiser considered not the most accurate method in EFA and PCA?
It tends to overestimate the number of factors, especially when many variables are included.
What exactly is the limitation of the elbow method?
Elbow point is subjective and therefore ambiguous.
What is the process of revalidating a test by using a different group and their scores as the criterion?
Cross-validation
What is the primary goal of cross-validation?
To check whether the test’s validity generalizes beyond the initial sample.
What is the difference between co-norming and co-validation?
Co-validation → Involves checking whether two tests predict the same criterion. It’s about comparing their validity evidence — do both tests measure what they claim to measure in relation to an external standard? Co-norming → Involves administering two or more tests to the same group of test-takers so their scores can be placed on a common normative scale. It’s about aligning score interpretations across tests.
What is the difference between co-validation and predictive validity?
Co-validation: comparing two tests against the same criterion. Predictive validity: evaluating one test’s ability to predict future outcomes.
What is the process of administering a test for the purpose of establishing norms?
Standardization
What is the process of selecting a portion to represent a population
Sampling
What are the three types of sampling method?
Stratified sampling → The population is divided into subgroups (strata) based on characteristics (e.g., age, gender), and samples are taken from each subgroup to ensure representation.
Purposive sampling → The researcher deliberately selects participants who fit specific criteria or purpose (e.g., experts in a field).
Incidental sampling → Also called convenience sampling; participants are chosen simply because they are easy to access (e.g., whoever happens to be available).
What is the difference between grade and age norms?
Grade norms: grade-level basis; Age norms: age-equivalent scores
How can we differ national norms from national anchor norms?
National norms → Scores are collected from a representative sample of the entire nation. They provide a baseline for interpreting individual test scores relative to the general. National anchor norms → Created by linking or equating scores from different tests using a common national reference group. They allow scores from different tests to be compared on the same scale.
What do you call the type of norms that are typically developed by test users?
Local norms
What do you call the type of norm that can be further segmented by any of the criteria initially used in selecting subjects for the sample?
Subgroup norms
What is the difference between criterion- and norm-referenced test?
Criterion: compares the scores of the test-taker against a fixed standard. Norm: compares the scores against the scores of other test-takers
If norm-referenced is all about assessing "Where do I stand compared to others?" and criterion-referenced is all about "How much of the material do I know?", how is fixed-referenced any different?
Compares the test-taker's performance against the scores of past test-takers.
If p value is above 0.10, what does this signify?
No evidence against the null hypothesis
If p value is lower than or equivalent to 0.001, what does this signify?
Extremely strong evidence against the null hypothesis
If alpha is more than 0.9, what does this signify?
Excellent internal consistency
If alpha is less than 0.6 but more than 0.50, what does this signify?
Poor internal consistency reliability
At which point do we tag inter-item consistency as questionable?
0.60 < a < 0.70
At which point do we tag validity as very beneficial and depends on the circumstances?
Above 0.35
A test's reliability is interpreted as having limited applicability if the coefficient is __.
Below 0.70
An interrater reliability coefficient ranging 0.4-0.7.5/0.8 is considered fair to good, according to which statistic?
Fleiss, 1981
Which inter-rater reliability measure mimics Fleiss' Kappa but incorporates predefined benchmarks for qualitative interpretations like 'Good' and 'Fair'?
Cicchetti and Sparrow
__ reliability coefficient is considered perfect and may indicate redundancy/homogeneity.
0.95 and above
__ reliability coefficient is considered the minimum for clinical settings.
0.9
__ reliability coefficient is considered the minimum for psychometric tests.
0.8
__ reliability coefficient is considered acceptable for research.
0.7