Psychometric Properties and Principles

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/95

There's no tags or description

Looks like no tags are added yet.

Last updated 6:04 AM on 5/28/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

96 Terms

New cards

What is dependability?

Reliability

New cards

What is the reliability coefficient?

The index of reliability that shows the proportion of the true score variance

New cards

What is test score variability?

Refers to how spread out test scores are within the group

New cards

What is the difference between systematic and random error?

Random error involves unpredictable fluctuations and destroys reliability; systematic error is biased and constant, destroying the validity.

New cards

How is systematic a threat to validity more than reliability?

Because systematic error is constant, it mimics consistency; however, it consistently deviates from the truth.

New cards

What are the sources of error?

Item/content sampling, test administration, test scoring and interpretation

New cards

What is item/content sampling?

Process of selecting limited set of items to represent a broader domain of knowledge.

New cards

What is the difference between internal and external reliability?

Internal reliability refers to the consistency within the test itself; external reliability refers to the consistency of scores across differing circumstances, times, or raters.

New cards

What are the sources of error in test-retest reliability estimates?

Time-sampling, carryover/practice effects, maturation/reactivity

New cards

Which reliability coefficient is entirely free from content-sampling error?

Test-retest; because the content remains constant

New cards

What is the difference between carryover and practice effects?

Carryover is when emotional or physical states from the first test persist into the second, either increases or decrease test scores; practice is a type of carryover where scores increase due to memory or skill gain, usually increasing the test scores

New cards

How do test sophistication and test wiseness differ from carryover and practice effects?

Test sophistication: Familiarity with a specific test or type of test that can boost scores (knowing the format, structure, or typical items). Test wiseness: General strategies for answering test questions effectively, regardless of content (e.g., eliminating wrong choices, time management). Carryover effects: When performance on one test influences performance on a later test (e.g., memory of items, lingering effects). Practice effects: Improved performance simply from repeated exposure to the test, not because of learning the content but because of familiarity with the task.

New cards

What is the difference between parallel and alternate forms?

Parallel forms: mathematically identical (same means, variances, and errors). Alternative: closely matched but not mathematically identical

New cards

How does counterbalancing help with carryover effects?

Counterbalancing helps by neutralizing the order in which conditions are presented so that carryover effects (like fatigue, practice, or boredom) are shared equally across all groups.

New cards

What correlation coefficient do we use for parallel/alternate forms reliability?

Pearson r

New cards

If test-retest = coefficient of stability, what is for parallel/alternate forms?

Coefficient of equivalence

New cards

What is the difference between KR20 and KR21 and when do we use them respectively?

KR20 is used for dichotomous items with different levels of difficulty.
KR21 is used for dichotomous with same level of difficulty.

New cards

What is the difference between Cronbach’s Alpha and McDonald’s Omega?

Cronbach's Alpha: Assumes all items contribute equally to the score (tau-equivalence). McDonald’s Omega: Allows for different item contributions, giving a more flexible measure of consistency.

New cards

Define tau equivalence and explain what happens if it’s violated according to Cronbach’s Alpha?

Tau-equivalence means that each item on a test contributes equally to the true score, with differences only due to random error. If it’s violated, items have unequal contributions, leading to a lower Cronbach’s alpha and less reliable test scores overall.

New cards

What is Average Proportional Difference?

The percentage that represents the difference between two values. This is so you can easily see and compare differences in a standardized way

New cards

What is the difference between Spearman Brown and Rulon’s Formula?

The Spearman-Brown formula is typically used to estimate reliability when the two halves of a test have about equal variance. The Flanagan-Rulon formula is used when the two halves of the test have unequal variances. It adjusts for differences in standard deviations, making it more suitable when the halves are not equally distributed

New cards

When do we use Kappa Statistics and Kendal’s W.?

Kappa is for categorizing nominal data; Kendal's is for ranking ordinal data

New cards

When do we use Fleiss and Cohen’s Kappa?

Cohen's is used for 2 raters; Fleiss is used for 3 or more raters

New cards

What is the difference between criterion- and norm-referenced test?

Norm-referenced test involves comparing an individual's scores against other people's performance; criterion-referenced test involves comparing test scores against a criterion or predetermined standard.

New cards

Differentiate CTT, Domain Sampling, Generalizability, and IRT.

CTT states that observed scores comprise true scores plus error.
Domain Sampling states that adding items increases reliability by better representing the content
IRT improves on CTT by providing sample-independent item statistics at the cost of much larger sample sizes.

New cards

What is the concept of sample-dependency all about?

In CTT, if the sample has high ability, the items appear easy; conversely, if the sample has lower ability, the items appear difficult. Essentially, item statistics are contingent on the specific sample taking the test rather than the item itself.

New cards

How does Domain Sampling relate with CTT

Domain Sampling explains why CTT rules work: A bigger scoop (more questions) gives a more accurate sample of your knowledge, which naturally cuts out error and boosts reliability.

New cards

What is another term for IRT?

Latent-trait theory

New cards

Why is IRT better than CTT?

Because IRT is sample-independent, an item's parameters (such as difficulty and discrimination) remain invariant regardless of the ability level of the sample taking the test.

New cards

If test-retest = coefficient of stability, what is for interrater, split-half, and inter-item reliability?

Test-retest reliability → Coefficient of Stability (consistency of scores over time). Interrater reliability & Parallel/Alternate forms reliability → Coefficient of Equivalence (agreement between different raters/observers). Split-half reliability → Coefficient of Internal Consistency (consistency between two halves of a test). Inter-item reliability → Coefficient of Homogeneity (consistency among items within the same test).

New cards

What is the equivalent of 土1 and 土2 SEM in confidence interval?

±1 SEM ≈ 68% confidence interval → The true score is likely within 1 SEM above or below the observed score. ±2 SEM ≈ 95% confidence interval → The true score is very likely within 2 SEMs above or below the observed score.

New cards

How do we differentiate SEM, SED, SEE?

SEM (Standard Error of Measurement): Focuses on a single score; shows how much error is in one observed score. Linked to reliability. SED (Standard Error of Difference): Compares two scores; indicates whether the gap between them is real or significant. SEE (Standard Error of Estimate): Evaluates prediction accuracy; shows the average amount of error in predicted scores. Linked to validity.

New cards

How do we differentiate reliability and validity?

Reliability: Consistency of measurement. A test is reliable if it produces stable, repeatable results (like a scale giving the same weight each time).
Validity: Accuracy of measurement. A test is valid if it actually measures what it claims to measure (like a scale measuring weight, not height).

New cards

What is the relationship between reliability and validity?

A test can be reliable but not valid (e.g., consistently wrong), but it cannot be valid unless it is reliable.

New cards

What is the concept that refers to a judgment regarding how well the test measures what it purports to measure at the time and place the variable is naturally being emitted?

Ecological validity

New cards

Differentiate external validity and ecological validity.

External Validity: Who, Where, and When (Generalizability). Ecological Validity: The Setting (Naturalness vs. Artificiality)

New cards

What are the components of ecological validity?

Verisimilitude → Appearance: The degree of resemblance between the test situation and real-world conditions. "Looks like the real life". Veridicality → Accuracy: The degree to which test scores truly reflect or predict actual functioning in the real world. "Works like the real life"

New cards

How can we increase internal and external validity?

Internal validity: use random assignment, standardization, and counterbalancing to control extraneous variables; external validity: use diverse yet intentional participants as well as a naturalistic setting

New cards

How do external, internal, conceptual, and face validity differ from each other?

Internal: confidence that there is a relationship; external: generalizability of the results. Conceptual: involves a theoretical foundation; face: appearance/judgment based on testtaker

New cards

What is the difference between conceptual and construct validity?

Conceptual is the blueprint on which the test is based. Construct ensures that the construct is being measured accurately.

New cards

Differentiate the trinitarian view of validity.

Content: whether the test covers all important aspects of the topic. Criterion-related: whether the test can compare to other measures; construct: whether the test align with or contradict the theory.

New cards

Define construct underrepresentation and construct-irrelevant variance.

Construct underrepresentation: indicates the test lacks the components of the construct. Construct-irrelevant: indicates that test picks up extraneous constructs which influences the accuracy of the test.

New cards

What are the core principles of content validity?

Representativeness, relevance, absence of bias, technical quality

New cards

How does content validity associate with construct and criterion validity?

Content validity ensures that the test covers all important aspects of the construct. It then ensures that the results correlate with those of another related measure or against a standard.

New cards

What do we use to calculate the content validity of each item according to experts? How is each item rated?

Content validity index measures each item by a 4-point scale

New cards

What does Zero CVR mean? How about a positive one?

0 CVR means no consesus between raters; +1 means all raters agree on the item's relevance ; -1 means all raters find the item irrelevant

New cards

What is the difference between I-CVI and S-CVI?

I-CVI: focuses on testing individual item's relevance; S-CVI: focuses on providing the summary of all items' relevance

New cards

What can criterion-related validity be used for?

Can be used to infer an individual's most probable standing on some measure of interest.

New cards

What is the primary difference between concurrent and predictive validity?

Concurrent validity → The degree to which test scores correlate with another measure taken at the same time. Predictive validity → The degree to which test scores forecast or predict performance on a relevant measure taken in the future.

New cards

What does high incremental validity indicate?

Incremental validity: The extent to which a new test or measure adds useful information beyond what existing measures already provide. High incremental validity → Shows that the measure contributes unique, additional predictive value not captured by other tests or assessments.

New cards

What are the core principles of construct validity?

Clear conceptualization, operationalization, empirical evidence

New cards

What are the six (6) evidences of construct validity and how do they differ from each other?

Evidence of homogeneity → Proof that the test measures only one construct (items are internally consistent).
Evidence of changes with age → The construct is expected to change across development, and scores should follow that expected course.
Evidence of pretest-posttest changes → If an intervention targets the construct, test scores should change accordingly.
Known-groups validity → The test should differentiate between groups known to differ on the construct (e.g., clinical vs. non-clinical).
Convergent validity → The test should correlate with other established measures of the same construct.
Discriminant validity → The test should show little to no correlation with measures of different constructs.

New cards

What do we use to evaluate both convergent and divergent validity simultaneously?

Multi-trait Multimethod Matrix

New cards

What else is the primary purpose of the Multi-trait Multimethod Matrix?

For assessing the construct validity of a set of measures in a study

New cards

What is the the function of each correlation in the MTMM matrix?

Monomethod-monotrait: reliability diagonal; should be the highest in the entire matrix. Heteromethod-monotrait: validity diagonal; provides evidence of convergent validity. Monomethod-heterotrait: should be low as it provides evidence of discriminant validity. Heteromethod-heterotrait: should be the lowest as it further supports the discriminant validity

New cards

If Heterotrait and Monomethod Triangle is high, this means that the test has __?

Low discriminant validity; because it's only capturing the test-taker’s ability with the method (e.g., response style, format familiarity) instead of uniquely measuring the intended construct.

New cards

What are latent variables?

Factors that are not directly observed but are inferred from patterns in responses or behaviors.

New cards

How does factor loading work?

A statistical value that shows how strongly a specific item (question) is associated with a latent construct (hidden variable); represents the correlation between the item and the latent construct.

New cards

Higher loadings that is close to __ mean the item is a strong indicator of the construct

New cards

How does factor analysis function?

A statistical technique used to identify latent variables (factors) by examining patterns of correlations among items or questions.

New cards

What are the two types of factor analysis?

Exploratory Factor Analysis (EFA) → Used when we have no preconceived idea of the factor structure. It aims to explore and discover the underlying dimensions. It is theory-generating. Confirmatory Factor Analysis (CFA) → Used when we already have a hypothesized factor structure based on theory or prior research. It aims to test and confirm whether the data fit that structure. It is theory-testing.

New cards

How does Factor Analysis differ from Principal Component Analysis?

Factor Analysis (FA) → Aims to identify latent variables (factors) by examining correlations among items; focuses on uncovering hidden structure. Principal Component Analysis (PCA) → Aims to reduce dimensionality by summarizing items into components that maximize explained variance; involves data simplification.

New cards

What does variance maximization have to do with PCA?

The more variance a component captures, the more information it retains from the original dataset, making summarization more effective.

New cards

PCA forces data into a shape that doesn't overlap as it follows the concept of __?

Orthogonality

New cards

Define eigenvalue and where does it lie in the scree plot?

Eigenvalue → A numerical value that represents the amount of variance explained by a factor or principal component; it is located in the y-axis.

New cards

For every item, the eigenvalue is equivalent to __. Additionally, what should we do with factors that have eigenvalues lower than that?

1.0; we discard those eigenvalues that are lower than 1.0

New cards

Which criterion/rule explains that we should retain factors that have an eigenvalue great than 1.0?

Kaiser criterion/K1 rule

New cards

Why is Kaiser considered not the most accurate method in EFA and PCA?

It tends to overestimate the number of factors, especially when many variables are included.

New cards

What exactly is the limitation of the elbow method?

Elbow point is subjective and therefore ambiguous.

New cards

What is the process of revalidating a test by using a different group and their scores as the criterion?

Cross-validation

New cards

What is the primary goal of cross-validation?

To check whether the test’s validity generalizes beyond the initial sample.

New cards

What is the difference between co-norming and co-validation?

Co-validation → Involves checking whether two tests predict the same criterion. It’s about comparing their validity evidence — do both tests measure what they claim to measure in relation to an external standard? Co-norming → Involves administering two or more tests to the same group of test-takers so their scores can be placed on a common normative scale. It’s about aligning score interpretations across tests.

New cards

What is the difference between co-validation and predictive validity?

Co-validation: comparing two tests against the same criterion. Predictive validity: evaluating one test’s ability to predict future outcomes.

New cards

What is the process of administering a test for the purpose of establishing norms?

Standardization

New cards

What is the process of selecting a portion to represent a population

Sampling

New cards

What are the three types of sampling method?

Stratified sampling → The population is divided into subgroups (strata) based on characteristics (e.g., age, gender), and samples are taken from each subgroup to ensure representation.
Purposive sampling → The researcher deliberately selects participants who fit specific criteria or purpose (e.g., experts in a field).
Incidental sampling → Also called convenience sampling; participants are chosen simply because they are easy to access (e.g., whoever happens to be available).

New cards

What is the difference between grade and age norms?

Grade norms: grade-level basis; Age norms: age-equivalent scores

New cards

How can we differ national norms from national anchor norms?

National norms → Scores are collected from a representative sample of the entire nation. They provide a baseline for interpreting individual test scores relative to the general. National anchor norms → Created by linking or equating scores from different tests using a common national reference group. They allow scores from different tests to be compared on the same scale.

New cards

What do you call the type of norms that are typically developed by test users?

Local norms

New cards

What do you call the type of norm that can be further segmented by any of the criteria initially used in selecting subjects for the sample?

Subgroup norms

New cards

What is the difference between criterion- and norm-referenced test?

Criterion: compares the scores of the test-taker against a fixed standard. Norm: compares the scores against the scores of other test-takers

New cards

If norm-referenced is all about assessing "Where do I stand compared to others?" and criterion-referenced is all about "How much of the material do I know?", how is fixed-referenced any different?

Compares the test-taker's performance against the scores of past test-takers.

New cards

If p value is above 0.10, what does this signify?

No evidence against the null hypothesis

New cards

If p value is lower than or equivalent to 0.001, what does this signify?

Extremely strong evidence against the null hypothesis

New cards

If alpha is more than 0.9, what does this signify?

Excellent internal consistency

New cards

If alpha is less than 0.6 but more than 0.50, what does this signify?

Poor internal consistency reliability

New cards

At which point do we tag inter-item consistency as questionable?

0.60 < a < 0.70

New cards

At which point do we tag validity as very beneficial and depends on the circumstances?

Above 0.35

New cards

A test's reliability is interpreted as having limited applicability if the coefficient is __.

Below 0.70

New cards

An interrater reliability coefficient ranging 0.4-0.7.5/0.8 is considered fair to good, according to which statistic?

Fleiss, 1981

New cards

Which inter-rater reliability measure mimics Fleiss' Kappa but incorporates predefined benchmarks for qualitative interpretations like 'Good' and 'Fair'?

Cicchetti and Sparrow

New cards

__ reliability coefficient is considered perfect and may indicate redundancy/homogeneity.

0.95 and above

New cards

__ reliability coefficient is considered the minimum for clinical settings.

0.9

New cards

__ reliability coefficient is considered the minimum for psychometric tests.

0.8

New cards

__ reliability coefficient is considered acceptable for research.

0.7

New cards

What is the difference between percentile and percentage correct?

Percentile: standing compared to others. Percentage correct: number of items you got right