Part 2
Part 2: Data preparation, data exploration, cleaning, and measurement theory
- Recap from Part 1: raw data from online surveys (e.g., Qualtrics) arrives in a spreadsheet; needs cleaning before analysis.
- Today’s focus: a theoretical view of what scales measure in surveys and how that informs data processing and interpretation.
- Key idea: many psychological constructs are latent variables; you cannot observe them directly through a single item.
- Latent variables vs explicit variables:
- Explicit variables (e.g., height, weight) can be measured directly without questions.
- Latent constructs (e.g., well-being, attitude, stress, intelligence, personality) require a multi-question survey to approximate the construct.
- Why multi-item scales?
- A single item is rarely sufficient to capture a complex latent construct.
- Multiple items are combined to form a more reliable measure of the underlying construct.
- Distinction between scale and questionnaire:
- A scale is a validated instrument with theoretical justification for combining items into a single score.
- A questionnaire is not necessarily validated; it may be researcher-made.
- Unidimensional vs multidimensional scales:
- Unidimensional: all items measure one single construct (no sub-dimensions).
- Multidimensional: items represent distinct domains within a broader construct (sub-dimensions or factors).
- Examples of constructs and scales:
- Academic major satisfaction: unidimensional scale example; typically uses a small number of items (e.g., 6) to yield a single composite score.
- Meaning of life: multidimensional scale example with two dimensions:
- Presence of meaning (items 1, 4, 5, 6, 9; with item 9 reverse-coded)
- Search for meaning (items 2, 3, 7, 8, 10)
- Can compute total meaning-of-life score (sum of all 10 items) or separate scores for each dimension, or both.
- Practical characteristics of scales:
- Scales can be short or long (e.g., 40–50 items originally, later shortened to ~7 items).
- Longer surveys increase participant fatigue and missing data risks; shorter scales with the same information are preferred when valid.
- Regardless of scale type, you typically create a composite score by aggregating item responses, not analyzing items individually.
- Composite scores: why and how
- Composite score = a total or a mean score across items measuring the construct.
- Example: Life satisfaction scale with 5 items (range per item: 1–6) -> total score range: 5–30; mean score range: 1–6.
- Composite score types:
- Sum (total) score: S = x1 + x2 + … + xk
- Mean (average) score:
ar{X} = rac{1}{k} \, \sum{i=1}^k xi - Guidelines:
- Follow the original instrument’s scoring method when possible (sum or mean).
- Use the method used by the instrument’s authors to maintain comparability with prior research.
- How to compute composite scores (practical SPSS overview):
- Use SPSS Compute Variable to create a composite score via a numeric expression.
- Example: Sum of five life-satisfaction items:
- Expression: x1 + x2 + x3 + x4 + x5
- Result creates a new column (e.g., lifesatisfactiontotal) at the end of the dataset.
- Rename and reposition the new column to align with the original item columns for easy checking.
- Double-check calculations by verifying that item sums match the new total (e.g., 3 + 6 + 5 + 5 + 3 = 22).
- Alternatively, create a mean score: mean(x1, x2, x3, x4, x5) to yield a final score in the 1–6 range (if items are 1–6).
- Always ensure the calculated composite makes sense and aligns with the original scoring method.
- Practical notes on data handling:
- After creating the composite, save the dataset and then proceed to final analyses on the composite score rather than the raw items.
- The process reduces complexity and improves reliability for downstream analyses.
- Fatigue and missing data remain important concerns; shorter, well-validated scales help mitigate this.
- Transition to data quality: assess whether the data are acceptable for analysis
- Once you have a complete dataset with composite scores, evaluate distribution shapes to see if they comply with assumptions of subsequent analyses.
- The shape of the distribution (normality) is central to many statistical tests (parametric tests assume normality).
- Normal distribution and the Central Limit Theorem (CLT)
- CLT intuition: if you repeatedly draw random samples from a population and compute their means, the distribution of those sample means approaches a normal distribution as the number of samples grows.
- Formal statement (conceptual): the distribution of sample means tends toward normality regardless of the population distribution as sample size increases.
- In practice: with sample sizes > ~30, the CLT tends to hold well, supporting parametric analyses.
- Important caveat: CLT does not absolve you from checking the distribution of your data; outliers and non-normality can still affect analyses, especially with smaller samples or non-parametric contexts.
- Outliers and their impact on normality and inference
- Outliers are individual scores that lie far from the rest and can distort the normal shape of the distribution.
- Effects of outliers:
- Inflate error and confidence intervals around a parameter estimate (e.g., the mean).
- Potentially bias significance tests and their conclusions.
- Common sources of outliers:
- Belonging to a different group than the rest of the sample (real but different population),
- Poor respondent (inattention, random answering),
- Data entry errors,
- Instrument glitches or survey issues (technical problems in data collection).
- Identifying outliers:
- Graphical: Explore -> Plots -> Histogram to visually inspect distributions and spot extreme values (e.g., life satisfaction total with a cluster of very low scores).
- Statistical: Save standardized values as variables to obtain z-scores for each observation.
- Rule of thumb: about 95% of data should fall within
-1.96 \le z \le 1.96 - More extreme cutoffs include
|z| > 2.58 and
|z| > 3.29 - In the example, 14 observations exceeded ±1.96, but none exceeded ±2.58; no official outliers by this criterion in that dataset.
- Rule of thumb: about 95% of data should fall within
- Handling outliers (common approach):
- If an observation exceeds ±3.29, consider adjusting the outlier score instead of deleting it outright.
- A typical method: set the outlier value to one unit higher than the next highest non-outlier value. For example, if 48 is an outlier and the next highest non-outlier is 25, adjust the outlier to 26.
- Rationale: preserves sample size and normality for analysis while recognizing the data point may reflect measurement issues or extreme cases.
- Cautions: this imputation can distort the true data-generating process if the outlier is a legitimate extreme case; consider the context and justifications before deciding.
- Practical decision-making about outliers:
- Investigate the number and patterns of outliers before removing them.
- If they reflect systematic issues, address root causes (data collection, instrument problems).
- If they reflect real variation, consider robust statistics or non-parametric analyses as alternatives.
- SPSS practical workflow for outliers and distribution checks (warm-up overview):
- Use Explore -> Plots -> Histogram to view raw data distribution for a variable (e.g., life satisfaction total).
- Use Explore -> Save -> Standardized values as variables to obtain z-scores for the data and identify potential outliers.
- Interpret z-score thresholds in the context of your data and the study design.
- Decide on a plan for handling outliers (investigate, impute, transform, or exclude) and document the rationale.
- Summary of the workflow and key takeaways
- Begin with a solid understanding of what you are measuring: latent constructs require scales and composite scoring.
- Prefer validated scales; follow original scoring instructions (sum vs mean; reverse-coded items).
- Create composite scores to represent constructs and align with prior literature for comparability.
- Check data quality: normality, outliers, and distribution shapes, guided by the central limit theorem and diagnostic plots.
- Use SPSS (Compute Variable, Explore, Plots) to implement scoring and outlier checks; document every step for transparency and reproducibility.
- Final practical reminder:
- Double-check all composite calculations by verifying that the total or mean aligns with the underlying item scores.
- Consider the trade-offs between data integrity and statistical convenience when deciding how to handle outliers or non-normal data.
- The approaches discussed here (composite scoring, normality assessment, and outlier management) will recur in later units as you advance in data analysis.
Key formulas and concepts
- Composite score (sum):
S = \sum{i=1}^k xi - Composite score (mean):
\bar{X} = \frac{1}{k} \sum{i=1}^k xi - Z-score (sample):
zi = \frac{xi - \bar{X}}{s} - Normal distribution and CLT (conceptual): the distribution of sample means tends toward a normal distribution as the number of samples grows; with large n (n > 30) this effect is typically robust.
- Outlier thresholds (typical references):
- Within ±1.96: about 95% of data
- Beyond ±2.58: more extreme but not necessarily outliers
- Beyond ±3.29: considered outliers
- Data handling goals:
- Create usable, comparably scored composites
- Preserve data integrity while ensuring normality assumptions are reasonable for subsequent analyses
- Document all decisions about scoring and outlier treatment for transparency