Part 5

Data transformation and the purpose of transformations

Transformation of data refers to mathematically transforming all observed scores in a dataset to produce a different scale of measurement with the goal of improving statistical properties for parametric analyses.
Primary aims include:
- improving normality in skewed distributions
- reducing heterogeneity of variances across groups (homogeneity of variances)
- reducing the influence of outliers
- improving linearity to satisfy assumptions of parametric tests
Important caveats:
- Do not transform willy-nilly; apply only for solid theoretical or statistical reasons
- Psychology data rarely follow perfect normal distributions; parametric tests (e.g., ANOVA) are often robust to deviations from normality
- Always assess whether robustness is sufficient before transforming; transformation should be a last resort
Key idea: a transformation changes the values of all data points but preserves their order; it changes distances but not the ranking
After transformation, the same data points may be more normally distributed or have more similar variances, which can influence test statistics favorably or unfavorably depending on the context
In regression vs ANOVA, the scope of transformation differs:
- Regression: transform only the problematic variable(s) that deviate from normality
- ANOVA: transform all groups/variables involved in the comparison
Example transformation: square root and log transformations
- Square root transformation for moderately positive skew:
- For each data point x, apply: ilde{x} =
  oot 2 0 igl(xigr) =
- In practice (SPSS): apply the transformation to every value in the dataset
- After transformation, compare distributions again using histograms and normality tests (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) and QQ plots
- If distribution improves, you may retain the transformation; if not, revert to raw data
Severe positive skew may warrant a logarithmic transformation:
- Use base 10: ilde{x} =
( ext{log}_{10}(x))
- Important constraint: zeros cannot be log-transformed; all values must be > 0
- Compare with square root: sometimes log is more effective, sometimes sqrt is better; you may try multiple transformations to see which yields better normality
What to expect after transformation:
- Normality tests may become non-significant (e.g., Shapiro-Wilk) if normality is improved
- QQ plots should look closer to the diagonal line
- Transformation can turn arithmetic means into geometric means, which is effectively moving from interval/ratio-scale data toward ordinal-like interpretation in some cases
Important conceptual caveat about data scale after transformation:
- Transformations can convert data from ratio/interval to ordinal (or at least alter the metric interpretation)
- The geometric mean (used with transformed data) can be a weaker measure of central tendency than the arithmetic mean on raw data
- Consequently, hypotheses built on transformed data may need reinterpretation
Practical guidance on transformations:
- Use transformations that are widely used and justified in the literature
- Avoid using many different transformations on the same data set when comparing means, as interpretation becomes fragile
- If transformations do not improve the data in a meaningful way, keep the raw data
Ethical/philosophical note:
- Measurement in psychology is often ordinal in nature; transforming data can push it toward ordinal interpretation, which has implications for what the statistics actually mean
- Be transparent about the transformation and the implications for hypothesis testing

Reliability and the concept of reliability in scales

Reliability in quantitative measurement is about freedom from error; observed scores include true scores plus measurement error
Classical view (short form):
- Observed score: X = T + E where T is the true score and E is error
- The goal of reliability is to maximize the proportion of observed variance due to true variance: reliability increases as error variance decreases
- A common formal expression (in classical test theory) is:
  ho_{XX'} = rac{Cov(X,X')}{ Var(X) Var(X')^{1/2}} for parallel forms, or more generally reliability equals the ratio of true-score variance to observed-score variance: ext{Reliability} = rac{Var(T)}{Var(X)}
Causes of measurement error span several sources:
- Test construction: domain sampling, item quality, item wording and grammar
- Test administration: participant factors (attention, motivation, fatigue, anxiety); test environment (lighting, temperature, noise); administrator effects
- Scoring: scoring rules, coder judgments, data entry mistakes
- Systematic error: bias in questions (impression management, response biases, social desirability)
The reliability of a scale is assessed through four main approaches (two covered in the current section; two discussed in the next slide): 1) Alternative/Parallel forms reliability: correlation between two different surveys measuring the same construct
- Example: two versions of a scale measuring “liking for elephants”; compute Pearson r between scores from form A and form B
  2) Test-retest reliability: correlation between scores on the same measure across two time points
- Example: measure participants now and then again later; correlate the two sets of scores to assess temporal stability
  3) Internal reliability (consistency of items within a single scale):
- Split-half reliability: correlate scores from two halves of the same test
- Cronbach's alpha: a more robust internal-consistency index derived from all possible split-halves
  4) Inter-rater reliability: correlation between scores assigned by different raters
- Example: two or more raters score the same responses; higher correlations indicate greater agreement
The general rule of thumb for reliability coefficients (Pearson correlations, Cronbach’s alpha, etc.):
- Values range from 0 to 1; higher is better
- Tertiary guidelines:
- > 0.9: excellent
- 0.8–0.9: good
- 0.7–0.8: acceptable
- 0.6–0.7: questionable
- < 0.6: poor
- < 0.5: unacceptable
In practice, a reliability coefficient above 0.9 is very strong but can indicate item redundancy; interpret with caution
Reliability is context-dependent; there is no single universal cutoff that fits all situations
When reliability is insufficient, consider revising items, improving administration procedures, or using different measurement tools

Cronbach's alpha and internal reliability in practice

Cronbach's alpha is a measure of internal consistency for a set of items intended to measure the same construct
- Conceptually, higher alpha suggests items are more homogeneous in measuring the same underlying construct
- Two common formulas (equivalent forms):
- ext{Alpha} = rac{N ar{c}}{ar{v} + (N-1)ar{c}}
  where N = number of items, ar{c} = average covariance between item-pairs, ar{v} = average variance of items
- ext{Alpha} = rac{N}{N-1} iggl(1 - rac{rac{rac{}rac{rac{rac{}rac{rac{rac}{}}}}{}}{} iggr)
  (alternative standard form involving item variances and the total variance; essentially, alpha increases with more items and with greater average inter-item covariance relative to total variance)
SPSS implementation (practical steps):
- Analyze -> Scale -> Reliability Analysis
- Select the items (e.g., four or more questions) to form a scale; label the scale (e.g., Support scale)
- Options to request: Cronbach’s Alpha, scale if item is deleted, descriptives, inter-item correlations
- Output typically includes:
- Reliability Statistics: Cronbach's Alpha value (e.g., 0.918)
- Inter-Item Correlation Matrix: pairwise correlations between items; should not have extremely high correlations between all items (e.g., > 0.8) which could indicate redundancy
- Item-Total Statistics: correlations of each item with the total score (Corrected Item-Total Correlation) and Alpha if Item Deleted
- If you check "Scale if item is deleted", you get a table showing how Cronbach's Alpha would change if each item were removed; useful to identify problematic items
Example interpretation from the transcript:
- Cronbach's Alpha reported as 0.918 for four items (excellent internal reliability)
- Inter-item correlations are all below 0.8, suggesting items are not redundant
- Alpha if Item Deleted table helps assess whether removing any item would improve reliability; in practice, if removing an item raises alpha significantly, that item may be problematic
- The output also includes an item-by-item table showing how correlations with the total score change if an item is deleted
Important caveat about high alpha:
- A very high alpha (close to 1) can occur if items are very similar in wording or content; this may inflate alpha without truly broadening the construct being measured
- Always review item content to ensure they capture different facets of the same construct and are not merely redundant
Inter-rater reliability (additional detail):
- When multiple raters are involved, compute correlations between raters’ scores
- An r around 0.8 or higher is generally considered acceptable; higher is better
- In practice, ensure that instructions for raters are standardized to reduce systematic differences in scoring

Practical considerations and common pitfalls when using reliability analyses

Key takeaways:
- Reliability is about consistency and error reduction, not about absolute accuracy
- Different reliability indices answer different questions (e.g., temporal stability vs. internal consistency vs. inter-rater agreement)
- Always interpret reliability alongside the validity of the measure; a reliable measure that is not valid is not useful
Common cautions:
- High internal consistency does not guarantee that the scale measures a single construct; it may just reflect redundancy among items
- If your data contain mixed constructs, consider using multi-dimensional reliability assessments or separate scales for each construct
- When reporting Cronbach's alpha, also report the number of items, sample size, and the context of the measurement for clarity
- Be mindful of the measurement scale: transforming data can alter the interpretation of reliability metrics; clearly state whether reliability was assessed on the original or transformed data

Quick recap and how this ties into the broader course content

Transformation of data helps meet assumptions for parametric statistics (normality, equal variances, linearity), but should be used judiciously and with a clear theoretical justification
Reliability of scales is essential for ensuring consistent, repeatable measurement of constructs in quantitative psychology research
Four main reliability types are covered: parallel forms, test-retest, internal reliability (split-half and Cronbach's alpha), and inter-rater reliability
SPSS provides practical tools to compute Cronbach's alpha, inspect item-total statistics, and evaluate whether removing items improves reliability
The week-one module ties these concepts to subsequent topics like analysis of variance (ANOVA) and regression, and emphasizes robustness and ethical data interpretation in psychological measurement

References to formulas and key terms used in this note

Transformation formulas:
- Square root transformation: ilde{x} =
  oot 2 0 igl(xigr)
- Log transformation (base 10): ilde{x} =
igl( ext{log}_{10}(x)igr)
Distance vs order preservation: transformations change distances but preserve order
True score and error (classical test theory): X = T + E,
ext{Reliability} = rac{Var(T)}{Var(X)}
Cronbach's Alpha (two common forms):
- ext{Alpha} = rac{N ar{c}}{ar{v} + (N-1)ar{c}}
- Alternative view: ext{Alpha} = rac{N}{N-1} iggl(1 - rac{ ext{sum of item variances}}{ ext{total variance}}iggr)
Inter-item correlation matrix and item-total statistics are produced in SPSS outputs to diagnose reliability