Session 4: Statistics

Introduction to Statistics in Psychodiagnostics

Statistics help organize, summarize, interpret, and communicate assessment results.
Competence in assessment requires statistical knowledge to understand, interpret, and evaluate test psychometrics (reliability, validity, standardization).

Descriptive vs Inferential Statistics

Descriptive Statistics: Summarize large datasets clearly.
Inferential Statistics: Make inferences about a population from a sample.

Variable

A variable is anything with more than one value, e.g., achievement, intelligence.

Measurement

Assigning numbers or symbols to objects, traits, or behavior per logical rules.
Example: Customer satisfaction on a 10-point scale.

Scale of Measurement

Categorizes variables. There are 4 scales: Nominal, Ordinal, Interval, Ratio.

Scale of Measurement: Key Properties

Defined by three properties:
1. Magnitude: Inherent order (smaller to larger).
2. Equal interval: Equal distance between adjacent points.
3. Absolute/true zero: Zero means absence of the property.

Scale of Measurement: Qualities, Examples

Nominal: No magnitude (e.g., names).
Ordinal: Magnitude present (e.g., rank order, Likert scales).
Interval: Magnitude and equal interval (e.g., temperature).
Ratio: Magnitude, equal intervals, and true zero (e.g., age, height, weight).

Describing Scores

Use descriptive statistics to organize scores:
- Frequency distribution: Order scores.
- Measures of central tendency: Typical performance (mean, median, mode).
- Measures of variability: Dispersion of scores (spread).
- Measures of relationship: Degree of relationship between variables.

Understanding Assessment Scores

Raw scores: Number of correct answers; only meaningful when compared to a standard.
Criterion-referenced scores: Compare individual to a specified performance level.
Norm-referenced scores: Compare individual to a norm group.

Criterion-Referenced Scores

Interpreted in absolute terms (percentages, cutoff scores) to show mastery.
Example: Passing score for a course.

Norm-Referenced Scores: Overview

Compare examinee to a relevant, representative, current, and adequately sized norm group.
Norms should be updated approximately every 10 years.

Type of Norm-Referenced Scores

1) Percentile ranks (PR)

Percentage of a distribution below a particular score.
Describes exact position within the distribution.
Quartiles: divide into 4 equal parts (lower, median, upper quartile).

2) Standard scores

Represent relative position, assuming a normal distribution.
Linear transformations of raw scores, retaining a direct relationship.
Examples: Z scores, T scores, Deviation IQs, CEEB scores, stanines, sten scores.

Standardized Score Examples (Visual Concept)

Z score: Typically -3 to +3 (approx. 68% within\pm1 SD, 95% within\pm2 SD, 99.7% within\pm3 SD).
Other scales: Deviation IQ, T scores, Stanine (9-point), Sten scores, SAT/GRE scales.

3) Grade and age equivalents

Grade-equivalent scores: Average score for children at various grades. (norm ref.)
Age-equivalent scores: Performance in terms of age at which average individual matches.

Reliability

Consistency, dependability, and reproducibility of test scores across items, forms, or repeated administrations.
If reliable, repeated measures under same conditions yield identical/nearly identical results.
Core equation: Observed\;score = True\;score + Measurement\;error

Measurement Error

Scores are rarely error-free; error from test-taker factors, flawed procedures, or chance.
True score is ideal; observed score is actual. Greater error means lower reliability.
Relation: obs. score = true score + measurement error.

Sources of Measurement Error

1) Time-sampling error: Fluctuations due to when repeated testing occurs (e.g., practice effects, maturation).
2) Content-sampling error: Error from inadequate item selection to cover content.
3) Interrater differences: Error from subjective scoring judgments; assessed by interrater reliability.
4) Other sources: Item quality, test length, test-taker variables (motivation, fatigue), poor administration, room conditions.

Validity

Refers to whether assessment claims/decisions are sound, meaningful, and useful for the intended purpose.
Degree to which all evidence supports intended interpretation of test scores.

Construct Validity

Latent variables (constructs) cannot be measured directly (e.g., aggression, resilience, depression).
Inferred from interrelated variables/dimensions.

Threats to Validity

Underrepresentation: Test too narrow.
Construct irrelevant variance: Test too broad, includes irrelevant variables.
Other threats: Ambiguous items, too few items, improper item arrangement, scoring errors, test-taker characteristics (anxiety), inappropriate test groups.

Validity and Reliability Relationship

Reliability is necessary but not sufficient for validity.
Unreliable measures cannot be valid; reliable measures can still be invalid (measuring the wrong construct).

Types of Validity Evidence

Face validity: surface-level appropriateness
Content Validity: Evidence based on test content's representativeness of the content domain.
Criterion-related Validity: Evidence based on test scores' relationship with external criteria (Concurrent and Predictive validity).
Construct Validity: Evidence based on appropriateness of inferences about a construct (homogeneity, convergent/discriminant validity, group differentiation, factor analysis).
- Convergent validity
  This means your test should be strongly related to other tests that measure the same thing.
  Example: If you create a new questionnaire for social anxiety, it should correlate highly with an established social anxiety scale. That shows both are measuring the same construct.
- Discriminant validity
  This means your test should not be too strongly related to tests that measure different things.
  Example: Your social anxiety test should not correlate strongly with a math ability test. If it did, that would suggest your test isn’t specific and is picking up something unrelated.

Important Practical Takeaways

Reliability sets the upper limit on validity.
Validity depends on the intended score interpretation.
Consider norm group representativeness when interpreting scores.
Be mindful of measurement error sources biasing conclusions.

Common Examples and Terms Mentioned in the Transcript

Reliability examples: Identical/nearly identical measurements.
Validity types: Face, Content, Criterion-related (predictive & concurrent), Construct (convergent & discriminant).
Examples of tests: TAT & CAT, WAIS & WISC.
Normed score types: Percentile ranks, Z/T/Deviation IQ, Stanine, Sten; Grade and Age Equivalents.

Key Formulas and Notations

Observed score relation to true score and error:
- Observed\;score = True\;score + Measurement\;error
Standard score via linear transformation:
- Z = \frac{X - mean(X)}{sd(X)}