MD

Part 2

Part 2: Data preparation, data exploration, cleaning, and measurement theory

  • Recap from Part 1: raw data from online surveys (e.g., Qualtrics) arrives in a spreadsheet; needs cleaning before analysis.
  • Today’s focus: a theoretical view of what scales measure in surveys and how that informs data processing and interpretation.
  • Key idea: many psychological constructs are latent variables; you cannot observe them directly through a single item.
  • Latent variables vs explicit variables:
    • Explicit variables (e.g., height, weight) can be measured directly without questions.
    • Latent constructs (e.g., well-being, attitude, stress, intelligence, personality) require a multi-question survey to approximate the construct.
  • Why multi-item scales?
    • A single item is rarely sufficient to capture a complex latent construct.
    • Multiple items are combined to form a more reliable measure of the underlying construct.
  • Distinction between scale and questionnaire:
    • A scale is a validated instrument with theoretical justification for combining items into a single score.
    • A questionnaire is not necessarily validated; it may be researcher-made.
  • Unidimensional vs multidimensional scales:
    • Unidimensional: all items measure one single construct (no sub-dimensions).
    • Multidimensional: items represent distinct domains within a broader construct (sub-dimensions or factors).
  • Examples of constructs and scales:
    • Academic major satisfaction: unidimensional scale example; typically uses a small number of items (e.g., 6) to yield a single composite score.
    • Meaning of life: multidimensional scale example with two dimensions:
    • Presence of meaning (items 1, 4, 5, 6, 9; with item 9 reverse-coded)
    • Search for meaning (items 2, 3, 7, 8, 10)
    • Can compute total meaning-of-life score (sum of all 10 items) or separate scores for each dimension, or both.
  • Practical characteristics of scales:
    • Scales can be short or long (e.g., 40–50 items originally, later shortened to ~7 items).
    • Longer surveys increase participant fatigue and missing data risks; shorter scales with the same information are preferred when valid.
    • Regardless of scale type, you typically create a composite score by aggregating item responses, not analyzing items individually.
  • Composite scores: why and how
    • Composite score = a total or a mean score across items measuring the construct.
    • Example: Life satisfaction scale with 5 items (range per item: 1–6) -> total score range: 5–30; mean score range: 1–6.
    • Composite score types:
    • Sum (total) score: S = x1 + x2 + … + xk
    • Mean (average) score:
      ar{X} = rac{1}{k} \, \sum{i=1}^k xi
    • Guidelines:
    • Follow the original instrument’s scoring method when possible (sum or mean).
    • Use the method used by the instrument’s authors to maintain comparability with prior research.
  • How to compute composite scores (practical SPSS overview):
    • Use SPSS Compute Variable to create a composite score via a numeric expression.
    • Example: Sum of five life-satisfaction items:
    • Expression: x1 + x2 + x3 + x4 + x5
    • Result creates a new column (e.g., lifesatisfactiontotal) at the end of the dataset.
    • Rename and reposition the new column to align with the original item columns for easy checking.
    • Double-check calculations by verifying that item sums match the new total (e.g., 3 + 6 + 5 + 5 + 3 = 22).
    • Alternatively, create a mean score: mean(x1, x2, x3, x4, x5) to yield a final score in the 1–6 range (if items are 1–6).
    • Always ensure the calculated composite makes sense and aligns with the original scoring method.
  • Practical notes on data handling:
    • After creating the composite, save the dataset and then proceed to final analyses on the composite score rather than the raw items.
    • The process reduces complexity and improves reliability for downstream analyses.
    • Fatigue and missing data remain important concerns; shorter, well-validated scales help mitigate this.
  • Transition to data quality: assess whether the data are acceptable for analysis
    • Once you have a complete dataset with composite scores, evaluate distribution shapes to see if they comply with assumptions of subsequent analyses.
    • The shape of the distribution (normality) is central to many statistical tests (parametric tests assume normality).
  • Normal distribution and the Central Limit Theorem (CLT)
    • CLT intuition: if you repeatedly draw random samples from a population and compute their means, the distribution of those sample means approaches a normal distribution as the number of samples grows.
    • Formal statement (conceptual): the distribution of sample means tends toward normality regardless of the population distribution as sample size increases.
    • In practice: with sample sizes > ~30, the CLT tends to hold well, supporting parametric analyses.
    • Important caveat: CLT does not absolve you from checking the distribution of your data; outliers and non-normality can still affect analyses, especially with smaller samples or non-parametric contexts.
  • Outliers and their impact on normality and inference
    • Outliers are individual scores that lie far from the rest and can distort the normal shape of the distribution.
    • Effects of outliers:
    • Inflate error and confidence intervals around a parameter estimate (e.g., the mean).
    • Potentially bias significance tests and their conclusions.
    • Common sources of outliers:
    • Belonging to a different group than the rest of the sample (real but different population),
    • Poor respondent (inattention, random answering),
    • Data entry errors,
    • Instrument glitches or survey issues (technical problems in data collection).
    • Identifying outliers:
    • Graphical: Explore -> Plots -> Histogram to visually inspect distributions and spot extreme values (e.g., life satisfaction total with a cluster of very low scores).
    • Statistical: Save standardized values as variables to obtain z-scores for each observation.
      • Rule of thumb: about 95% of data should fall within
        -1.96 \le z \le 1.96
      • More extreme cutoffs include
        |z| > 2.58 and
        |z| > 3.29
      • In the example, 14 observations exceeded ±1.96, but none exceeded ±2.58; no official outliers by this criterion in that dataset.
    • Handling outliers (common approach):
    • If an observation exceeds ±3.29, consider adjusting the outlier score instead of deleting it outright.
    • A typical method: set the outlier value to one unit higher than the next highest non-outlier value. For example, if 48 is an outlier and the next highest non-outlier is 25, adjust the outlier to 26.
    • Rationale: preserves sample size and normality for analysis while recognizing the data point may reflect measurement issues or extreme cases.
    • Cautions: this imputation can distort the true data-generating process if the outlier is a legitimate extreme case; consider the context and justifications before deciding.
    • Practical decision-making about outliers:
    • Investigate the number and patterns of outliers before removing them.
    • If they reflect systematic issues, address root causes (data collection, instrument problems).
    • If they reflect real variation, consider robust statistics or non-parametric analyses as alternatives.
  • SPSS practical workflow for outliers and distribution checks (warm-up overview):
    • Use Explore -> Plots -> Histogram to view raw data distribution for a variable (e.g., life satisfaction total).
    • Use Explore -> Save -> Standardized values as variables to obtain z-scores for the data and identify potential outliers.
    • Interpret z-score thresholds in the context of your data and the study design.
    • Decide on a plan for handling outliers (investigate, impute, transform, or exclude) and document the rationale.
  • Summary of the workflow and key takeaways
    • Begin with a solid understanding of what you are measuring: latent constructs require scales and composite scoring.
    • Prefer validated scales; follow original scoring instructions (sum vs mean; reverse-coded items).
    • Create composite scores to represent constructs and align with prior literature for comparability.
    • Check data quality: normality, outliers, and distribution shapes, guided by the central limit theorem and diagnostic plots.
    • Use SPSS (Compute Variable, Explore, Plots) to implement scoring and outlier checks; document every step for transparency and reproducibility.
  • Final practical reminder:
    • Double-check all composite calculations by verifying that the total or mean aligns with the underlying item scores.
    • Consider the trade-offs between data integrity and statistical convenience when deciding how to handle outliers or non-normal data.
    • The approaches discussed here (composite scoring, normality assessment, and outlier management) will recur in later units as you advance in data analysis.

Key formulas and concepts

  • Composite score (sum):
    S = \sum{i=1}^k xi
  • Composite score (mean):
    \bar{X} = \frac{1}{k} \sum{i=1}^k xi
  • Z-score (sample):
    zi = \frac{xi - \bar{X}}{s}
  • Normal distribution and CLT (conceptual): the distribution of sample means tends toward a normal distribution as the number of samples grows; with large n (n > 30) this effect is typically robust.
  • Outlier thresholds (typical references):
    • Within ±1.96: about 95% of data
    • Beyond ±2.58: more extreme but not necessarily outliers
    • Beyond ±3.29: considered outliers
  • Data handling goals:
    • Create usable, comparably scored composites
    • Preserve data integrity while ensuring normality assumptions are reasonable for subsequent analyses
    • Document all decisions about scoring and outlier treatment for transparency