Measurement Scales, Latent Variables, and Statistical Inference – Study Notes

2.2.3 Algebraic Properties of the Scales

Different scales permit different valid algebraic operations.
Stevens’s four classic scales are discussed first, followed by embedding summative response scales within this framework.

2.2.4 Qualitative Versus Quantitative Measurement

Two broad categories of measurement scales: qualitative (categorical) and quantitative.
Qualitative measures characterize what is observed (nominal scale). Synonyms include:
- Categorical variables
- Nonmetric variables
- Dichotomous variables (when there are two values/categories)
- Grouped variables
- Classification variables
Quantitative measurement is more restrictive in the sense that it is meaningful to compute a mean and standard deviation.
Ordinal scales presuppose an underlying quantitative dimension, but not every ordinal variable yields meaningful means.
Quantitative labels include:
- Continuous variables (though many quantitative variables can be discrete stepping)
- Metric variables
- Ungrouped variables

2.2.5 Criticisms of Stevens's Schema

Stevens’s scale types are widely used but not without critique.
Other classification systems exist (e.g., Mosteller & Tukey, 1977).
Velleman & Wilkinson (1993) discuss critiques of Stevens’s notions.
Arguments that the mathematical operations you can perform depend more on the research questions than on the exact level of the scale (e.g., Guttman, 1977; Lord, 1953a).
The presented framework is a useful starting point for understanding scale types and implications for data analysis, but it is not dogma.

2.3 Independent Variables, Dependent Variables, and Covariates

Variables are central to research design, measurement, and statistics; roles can vary by analysis.
A variable can have multiple roles in different analyses (e.g., a mediator can be both dependent and independent in different parts).

2.3.1 Independent Variables

In a prototypical experimental study, the independent variable represents the manipulation by the researchers (the treatment effect) contrasted with a control.
Features:
- May have two levels (e.g., control vs. experimental) or more (e.g., control, placebo, experimental).
- It is a single entity or continuum regardless of levels.
- In experiments, often based on qualitative measurement; in prediction (regression), usually quantitative.
In regression analyses, predictors are considered independent variables.

2.3.2 Dependent Variables

In experiments, the dependent variable is the outcome measured by the researchers.
In correlation/prediction designs, all measures can be viewed as dependent because there is no active manipulation.
Dependent variables can be assessed on any scale, but in the book’s designs they are almost always on quantitative scales.
In ANOVA, the dependent variable’s variance is explained by the independent variables; e.g., the statement "The mean for females was 3.52" refers to the dependent variable in that group.
In multiple regression, the criterion variable is the dependent variable.

2.3.3 Covariates

A covariate is a variable that correlates with a dependent variable and can influence observed relationships if not accounted for.
Covariates can mediate or confound relationships between independent and dependent variables.
Classic example: ice cream sales and crime rate both correlate with temperature; weather mediates their relationship.
If temperature is not accounted for, it acts as a confound.
If temperature is included as a covariate or mediator, the observed association between ice cream sales and crime rate can weaken after accounting for temperature.
In ANOVA contexts, including a covariate leads to ANCOVA (analysis of covariance), e.g., controlling for verbal ability when predicting math problem solving.
ANCOVA allows statistical control for a variable not experimentally controlled.

2.4 Between Subjects and Within Subjects Independent Variables

2.4.1 Between Subjects Variables

Between-subjects variable levels comprise separate groups (e.g., girls vs. boys; different diagnostic groups).
Scores in different groups are independent of each other.

2.4.2 Within Subjects Variables

Within-subjects variables include measurements on the same cases across conditions or times.
Example: pretest and posttest measurements on the same individuals.
Within-subjects variable is also called a repeated measures variable because scores across conditions are related.

2.5 Latent Variables and Measured Variables

2.5.1 Latent Variables

Latent variables are constructs identified in theory and not directly measured; they are assessed indirectly.
Historical illustration: manifest content vs. latent meaning (Freud, Tolman).
Examples: learning, motivation, job satisfaction, attitude toward life, ethnic identification.
Latent variables are central to many theories of human behavior.

2.5.2 Measured Variables

Measured variables are those for which we have actual data.
Examples: inventory item responses, choices, time spent in a behavior, gender indicator.
Also known as:
- Manifest variables
- Indicator variables
- Observed variables
Measured variables serve as proxies or indicators for latent variables.

2.5.3 Linking Latent Variables to Measured Variables

In many multivariate designs, latent variables are posited and measured via indicators.
Example: the broad construct of achievement can be indicated by GPA; GPA is a quantitative indicator of a latent construct (achievement).
GPA is not a perfect indicator; multiple indicators may be warranted to better estimate the latent construct.
Latent constructs can be composed of measured variables and/or other latent variables.
Multivariate procedures help determine how measured variables should be weighted to form a latent construct (e.g., factor analysis, which identifies shared themes).

2.5.4 Variates as Latent Variables

Latent variables can be imagined as weighted combinations of multiple measured variables (variates).
Example: Coopersmith Self-Esteem Inventory (25 items, scored 1 for endorsement and 0 otherwise; a self-esteem score is formed by summing item scores and multiplying by 4).
The resulting self-esteem score is a variate: a latent construct formed from measured indicators.
Variates can combine measures from different sources (e.g., family history, symptoms, prior GPA for a willingness-to-seek-counseling construct).
Multivariate procedures (e.g., confirmatory factor analysis, SEM) relate latent variables to indicators and relate latent variables to each other.

2.6 Endogenous and Exogenous Variables

In path analysis and structural equation modeling, a model specifies how variables relate to each other.
Endogenous variables are explained or predicted by other variables in the model.
Exogenous variables have no predictors in the model and act as first causes to explain endogenous variables.
Example: sex and ethnic identification as exogenous variables explaining adherence to medication regimens; compliance is endogenous.

2.7 Statistical Significance

Statistical significance tests judge how likely it is that an observed outcome (e.g., a correlation or F value) would occur by chance if the population has no true effect.
General concept: test statistic is compared to what would be expected under the null hypothesis.

2.7.1 Degrees of Freedom

Degrees of freedom count how many values in a set can vary given constraints.
Example: for a set of five numbers with a fixed mean, four numbers can vary; the last is determined by the fixed mean (4 degrees of freedom).
Commonly, df = N - 1 for many statistics; for Pearson r, df = N - 2 because r is a standardized regression coefficient defined by two points.

2.7.2 Sampling Distributions

A sampling distribution shows the distribution of a statistic across repeated samples from the population.
For a true population correlation of 0, the sampling distribution is centered at 0 and symmetric; most sample correlations cluster around 0 with fewer extreme values.
For a true population correlation of 0.90, the distribution peaks near 0.90 and is highly constrained on the upper end (since r ≤ 1). The distribution becomes skewed as the true parameter approaches its bounds.
The shape depends on the true parameter value and sample size.

2.7.3 The Role of Sample Size

Larger samples yield sampling distributions that are tighter around the true parameter (smaller standard error), making it harder to obtain large deviations by chance.
Smaller samples yield more variability; large correlations can occur by chance with small N.

2.7.4 Determination of Significance

If the sampling distribution is normal, 95% of area lies within ±1.96 standard deviation units from the mean. Values beyond this region are considered statistically significant at alpha = 0.05.
Some statistics have non-normal sampling distributions (e.g., t, F, noncentral distributions). Modern software provides exact p-values based on the appropriate distribution.
For Pearson r, the test statistic t is defined as
t = \frac{r \sqrt{N-2}}{\sqrt{1-r^2}}
and is compared to the t distribution with df = N - 2.
When population effect size is not zero, transformations (e.g., Fisher’s z') and subsequent steps are used to assess significance.

2.7.5 Levels of Significance

A pre-set alpha level (e.g., \alpha = 0.05) is chosen before analysis.
Significance is a yes/no decision: a result is either statistically significant or not; terms like "highly significant" are discouraged.
Strength of an observed effect is better assessed via effect size indices or variance explained rather than significance alone.

2.7.6 Statistical Significance Versus Confidence Interval Estimation

Confidence intervals offer a range of plausible values for a parameter, conveying precision and practical significance.
Example: reporting that a one-year survival rate increased by 20 percentage points with a 95% confidence interval of 15 to 24 percentage points is often more informative than just a p-value.
Confidence intervals complement NHST and help avoid over-interpretation of dichotomous results.

2.7.7 Null Hypothesis

NHST assumes the null hypothesis is true; the p-value indicates how often you would observe the statistic or more extreme if the null is true.
Rejecting the null suggests the observed effect is unlikely under the null, but not necessarily practically important.
Rejection risk (Type I error) equals the alpha level; true relationships can still be missed if the null is rarely true due to sampling variability.
Thompson (1994) notes that when the null is not literally true in the population, sample means may differ even with nonzero true differences, complicating interpretation.

2.7.8 Type I and Type II Errors

Type I error: incorrectly rejecting the null hypothesis (false positive). Probability equals the alpha level.
Type II error: failing to detect a true effect (false negative). Probability is beta; power = 1 - \beta.
Alpha and power are linked; stricter alpha reduces Type I error but increases Type II error, and vice versa.
Practical considerations (e.g., severity of consequences) influence the choice of alpha.

2.7.9 The Current Status of Statistical Significance Testing

There is growing concern about overreliance on NHST; emphasis is shifting toward replication, effect sizes, and confidence intervals.
The APA (2010) encourages reporting exact p-values and including effect sizes and confidence intervals in reports.
NHST is not discarded but should be integrated with estimation approaches for comprehensive interpretation.

2.8 Statistical Power

Power is the probability of correctly rejecting a false null hypothesis (i.e., detecting a true effect).
Power is influenced by alpha, effect size, and sample size.

2.8.1 Definition of Power

Power = 1 - \beta, where beta is the probability of a Type II error.
Three main factors affect power: alpha level, effect size, and sample size.

2.8.2 Alpha Level

The alpha level controls the risk of a Type I error: e.g., \alpha = 0.05 means a 5% risk.
Increasing stringency (e.g., \alpha = 0.01) reduces Type I error but can increase Type II error (lower power).
In some contexts (e.g., medical research), extremely small alphas (e.g., 0.001) may be warranted to avoid dangerous false positives.
Conversely, higher alpha (e.g., 0.10 or 0.15) can increase power when the consequences of Type I error are not severe.
Example scenario from Pituch & Stevens (2016) discusses trade-offs between Type I and Type II errors depending on consequences.

2.8.3 Effect Size

Effect size reflects the strength or magnitude of an effect and is linked to power: larger effects are easier to detect.
In the correlation context, effect size refers to the strength of association; common indices include r and eta squared ((\eta^2)).
Three widely used effect-size indices:
- Cohen's d (for mean differences)
- Hedges' g (bias-corrected version of d for small samples)
- Glass' delta (uses control SD in the denominator)
They are calculated from the difference between group means and a variance term:
- Cohen's d:
  d = \frac{\bar{X}1 - \bar{X}2}{\sqrt{\frac{SD1^2 + SD2^2}{2}}}
- Hedges' g:
  g = \frac{\bar{X}1 - \bar{X}2}{\sqrt{\frac{(n1-1) SD1^2 + (n2-1) SD2^2}{n1 + n2 - 2}}}
- Glass' delta:
  \text{delta} = \frac{\bar{X}1 - \bar{X}2}{SD_{\text{Control}}} {where the control group SD is used in the denominator}
Cohen provides general guidelines for interpreting effect sizes (without context):
- For single-sample t, d: small ≈ 0.20, medium ≈ 0.50, large ≈ 0.80.
- For Pearson r: small ≈ 0.10, medium ≈ 0.30, large ≈ 0.50.
- For eta-squared ((\eta^2)) in ANOVA: small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14.
Figure reference: shows the calculation procedures for d, g, and delta.
In broader modeling contexts (e.g., SEM, CFA), RMSEA, GFI, and TLI are analogous indices of model fit, related to the strength of effects and their consistency with data.

2.8.4 Sample Size

Larger sample sizes generally increase power by reducing standard errors and narrowing confidence intervals.
Degrees of freedom are tied to sample size; larger df means smaller required test statistic values to achieve significance.
Researchers can perform power analyses in advance to determine needed sample size to achieve desired power.
Caution: very large samples can yield statistically significant results for trivial effects; effect-size reporting helps maintain perspective on practical significance.