Notes: Data Cleaning and Preparation for Quantitative Analysis (Week 1)

Course Context and Assessments

Overview: 13-week trimester; three assessment items (two practical assignments for quant and qual, plus an in-class exam). The quantitative component is emphasized across weeks 1–4 with mediation and moderation regression, path analysis, followed by ANOVA/ multivariate approaches and a later factor/ meta-analysis discussion. The in-class exam (week 13) is invigilated in class with cameras on.
Assessments breakdown: Quantitative assignment due around week 8; Qualitative assignment due around week 12; final exam combines material from both quantitative and qualitative strands. The aim is to separate the workload to avoid end-of-trimester pile-up and to provide structured preparation.
Exam prep: The instructor will provide topics and practice questions to guide study; expects to give a solid heads-up on topics and a practice exam package.
Readings and field readings: Readings labeled with dates (e.g., 02/2017, 2013, 2024) are not critical in a fixed way; students can pick any relevant field reading, textbook, or online resource (YouTube, etc.). The course emphasizes choosing materials that help with moderated regression analysis and path analysis rather than sticking strictly to one textbook.
Textbooks and resources: A newer text can be more helpful; the instructor notes that Andy Field is a good reference, but the class will go into more depth in certain areas. Access to a newer text can be beneficial.
Software choices: SPSS is recommended due to macro outputs (e.g., PROCESS macro for path analysis). Jamovi is a freeware alternative but less supported in this course; the macro output is configured especially for SPSS. The tutor notes potential future support for Jamovi but currently emphasizes SPSS for consistency.
Balance of quantitative and qualitative work: There is an intentional balance (roughly six weeks quantitative and six weeks qualitative) to ensure coverage of both strands. Marley (the qualitative teacher) begins around week 5.
Data focus and relevance: The class uses a live data preparation exercise (investigating the role of future orientation in career agency and perceived employability in an Australian adult population) with three scales: Future Orientation (FO), Career Strategies (CS), and Perceived Employability (PE).
Data collection context: Data collection typically uses Qualtrics for online surveys; participants complete multiple scales (FO: 13 items; CS: 26 items; PE: 16 items). Each scale has its own response range (FO: 1–5; CS: 1–6; PE: 1–7).
Data preparation goal: Turn raw Qualtrics export into a clean SPSS-ready dataset with id, cleaned variables, scale totals, and checked assumptions for subsequent analyses.

Data sources and scales

FO (Future Orientation): 13 items from the Zimbardo and Boyd Time Perspectives Inventory; scale range typically 1–5 per item.
CS (Career Strategies): 26 items representing seven development strategies for career development; scale range typically 1–6 per item.
PE (Perceived Employability): 16 items; scale range typically 1–7 per item.
ID and demographics: Collects consent, age, gender, education level, place of study, etc.; in this workflow, consent and age are particularly important for inclusion criteria.
Important note on data structure: Each row is a participant; columns are items within FO, CS, PE, plus demographics. The initial dataset contains many Qualtrics-provided metadata fields (e.g., dates, IPs) that should be deleted before analysis.

Data preparation steps (workflow overview)

Create an identification column (ID) from the SPSS case number to track participants across edits
Convert the ID column to nominal in Variable View and place it at the start of the dataset
Delete non-essential Qualtrics metadata columns (e.g., IP address, timestamps, etc.) unless needed for reporting
Handle consent and age: exclude participants without consent or under the age threshold (18 in this example) based on ethics and project scope
Remove non-responders: identify and delete participants who did not provide any data or did not complete scales
Decide on hiding vs deleting: some prefer hiding unused columns to preserve the original raw data; others delete for simplicity
Record deletions: document how many participants were removed and why (for thesis/reporting)
Validate remaining data: use descriptive statistics to confirm data integrity (e.g., number of remaining participants, distribution of key variables)

Detailed workflow steps (data cleaning and preparation)

Step 1: Create an ID column
- In SPSS: Transform → Compute Variable
- Case number is the default source for a unique identifier; compute a new variable (ID) using case number
- Example: ID = CaseNumber
- Verify with Frequencies or Descriptives: e.g., total cases = 304
- ext{Total cases} = 304
Step 2: Reorder and type the ID as nominal
- In Variable View: change the ID variable type to Nominal
- Optional: insert a new variable at the front for cleaner layout
Step 3: Remove rubbish columns (Qualtrics metadata)
- Identify columns that are not needed for analysis (e.g., raw dates, timestamps, IP, etc.)
- Delete those columns or hide them if desired; keep a raw copy for reference
Step 4: Determine inclusion/exclusion criteria for the dataset
- Exclude participants who did not provide consent
- Exclude underage participants (e.g., age < 18)
- Exclude non-responders (participants with no data across scales)
- Document the rationale for exclusions (e.g., ethical/reporting requirements)
Step 5: Inspect missing data and participant response patterns
- Use Descriptives → Frequencies to inspect consent and age fields and identify non-responders
- Use Descriptives to check for missing values across scales
Step 6: Manage missing data and partial responses
- Compute a rule of thumb: exclude cases with substantial missing data per scale (e.g., more than one-third missing within a scale)
- For the assignment dataset, imputation procedures will be introduced (EM method)
Step 7: Handle potential data entry issues (e.g., age entered as birth year)
- If an age is entered as year (e.g., 1976) convert to age (e.g., 49), justify the change
Step 8: Check for overall data integrity before imputation
- Use Descriptives and frequency checks on each item to ensure scales are reasonable (1–5, 1–6, 1–7 as appropriate)
Step 9: Prepare for missing data analysis and imputation
- If missing data exist, plan to apply multiple imputation (EM algorithm) to create a complete dataset for scale totals

Handling missing data (MCAR, EM, and reporting)

Missing data concepts
- MCAR: Missing Completely at Random (Little’s test)
- MAR: Missing at Random
- MNAR: Missing Not at Random
Little’s MCAR test (Little’s MCar test)
- Null hypothesis: data are MCAR
- If p > 0.05, data are not significantly different from MCAR; proceed with imputation if needed
- If p < 0.05, data are not MCAR; investigate reasons or adjust approach
Practical steps in SPSS (EM imputation)
- Analyze → Missing Values → Analysis by EM (Expectation-Maximization)
- Select the variables to include in the missing data analysis (scale items and relevant demographics; generally exclude consent since it is a pass/fail inclusion criterion)
- Run EM to create a new dataset with imputed values (SPSS will generate a new dataset, e.g., “untitled new”)
- Use the imputed dataset for subsequent analyses to compute scale totals and run regressions
Important caveats
- EM assumes missing data are MAR or MCAR; interpretation and reporting should acknowledge limitations
- For extensive missing data within a single scale (e.g., >33%), reconsider including that case in the analysis or imputation, as imputing substantial missingness can bias results
Practical reporting guidance
- In a thesis/report, document: whether MCAR was satisfied, the extent of missing data, and the imputation method used (EM)
- Note that imputation creates a complete dataset but should be reported as a methodological choice with justification

Creating scale totals (FO, CS, PE)

Rationale: Scales are typically multi-item constructs; total scores are used for subsequent analyses
FO total score
- FO_{total} =
\sum{i=1}^{13} FOi
CS total score
- CS{total} = \sum{i=1}^{26} CS_i
PE total score
- PE{total} = \sum{i=1}^{16} PE_i
Practical note: After imputation, compute these totals on the imputed dataset; verify correctness by hand for a few cases as a sanity check
Example workflow in SPSS
- Transform → Compute Variable → name new variable (e.g., FO_total)
- Use the function sum: FOtotal = FO1 + FO2 + … + FO13 or FO1 to FO13 (short form: FO ext{total} = ext{SUM}(FO1…FO_{13}))
Post-check
- Validate a few cases by hand to ensure totals align with the item values (e.g., for one participant, sum FO1 through FO13 and confirm FO_total)

Reverse coding (FO items 11–13 example)

Why reverse coding matters: Some scales include negatively worded items that must be reverse-scored so that higher scores consistently reflect higher levels on the construct
Example from FO: items FO11, FO12, FO_13 are reverse coded
Reverse-coding rule (1–5 scale)
- Old value x → New value x' = 6 - x
- So mapping: 1 → 5, 2 → 4, 3 → 3, 4 → 2, 5 → 1
SPSS procedure (recoding into the same variable)
- Transform → Recode into same variables (or into new variables for safety)
- Select FO11, FO12, FO_13
- Old and New values: 1 to 5, 2 to 4, 3 to 3, 4 to 2, 5 to 1
- Apply and verify that the values at FO11, FO12, FO_13 have changed as expected
Practical note
- Ensure you follow the authors’ coding scheme; mis-coding can invalidate results (the instructor shared a cautionary anecdote about forgetting to reverse code and the impact on findings)
After reverse coding
- Recompute scale totals for FO, CS, PE with the reversed items properly included

Descriptive checks and outlier handling

Univariate outliers
- For a single scale total, identify extreme scores using z-scores
- Common threshold: |z| > 3.29 corresponds to p < .001 under a normal distribution
- If a participant has |z| > 3.29 on a single scale total, flag as a potential univariate outlier; investigate context
Multivariate outliers
- Outliers that appear across several scales concurrently; a separate multivariate test is used (not the focus for this week)
- If identified, consider whether to exclude; this is typically addressed in later weeks
Normality checks
- Use Kolmogorov-Smirnov and Shapiro-Wilk tests, plus visual checks (histograms and QQ plots)
- With large samples (e.g., N ≈ 300+), normality assumptions are robust for regression and ANOVA; small deviations often acceptable
- If distribution appears non-normal, weigh the option of transformation against preserving data interpretability
Linearity checks
- Use scatterplot matrices (e.g., FOtotal vs CStotal, FOtotal vs PEtotal, CStotal vs PEtotal) to assess linear relationships
- A linear relationship is preferred for regression-based analyses; nonlinear patterns suggest nonlinearity in the model or the potential need for transformation or different modeling
Practical reporting
- Record that outliers were checked, normality and linearity assessed, and decisions made about retaining or excluding cases
- Include a note in the appendix about the diagnostic checks and their outcomes

Reliability analysis (Cronbach’s alpha)

Purpose: Assess internal consistency reliability of a scale (whether items measure the same construct)
Cronbach’s alpha formula (general form)
- \alpha = \frac{N}{N-1} \left(1 - \frac{\sum{i=1}^N \sigmai^2}{\sigma_T^2}\right)
- Where N is the number of items, $\sigmai^2$ is the variance of item i, and $\sigmaT^2$ is the variance of the total score across all items
Interpretation guidelines (typical benchmarks)
- 0.60–0.70: acceptable
- 0.70–0.80: good
- 0.80–0.90: very good
- >0.90: excellent but may indicate item redundancy
Example from FO (Future Orientation) scale
- Cronbach’s alpha ≈ 0.76 (reported as adequate/good)
SPSS procedure for reliability
- Analyze → Scale → Reliability Analysis
- Select FO items (FO1 to FO13)
- Run and report Cronbach’s alpha value along with the number of items
Practical note
- Report reliability for each scale separately (FO, CS, PE) and discuss whether any item revisions are warranted in future iterations

Practical notes for the SPSS workflow and reporting

Raw data preservation
- Always keep the raw Qualtrics export as a separate file before making edits
Documentation of decisions
- Keep a running log of which participants were excluded and why (consent, age, non-response, etc.)
Cross-checks
- Perform spot-checks by hand to verify totals and reverse coding mappings
Version control
- Save intermediate datasets (raw, cleaned, imputed) with clear filenames to avoid overwriting important data
Ethical and reporting considerations
- Report the extent of exclusions and missing data, and justify imputation choices and any eliminations in your thesis/report

What to expect in subsequent weeks (summary of trajectory)

Week 2–3: Path analysis and associated outputs; focus on mediation and moderation regression and the interpretation of path diagrams
Week 4: Advanced ANOVA/ multivariate approaches (MANOVA, ANCOVA) and manual derivation of ANOVA (for deeper understanding)
Week 5 onward: Marley takes over qualitative content (mixed methods, phenomenological and thematic analyses)
Week 11–12: Factor analysis and meta-analysis concepts; understanding scales construction and construct validity
Week 13: In-class exam with prepared topics; exam format includes a mix of questions from the lectures and the workshop materials

Quick reference formulas and values

Scale totals
- FO{total} = \sum{i=1}^{13} FO_i
- CS{total} = \sum{i=1}^{26} CS_i
- PE{total} = \sum{i=1}^{16} PE_i
Reverse coding for a 1–5 scale (example FO11, FO12, FO_13)
- Old x → New x' where x' = 6 - x
Cronbach’s alpha (reliability)
- \alpha = \frac{N}{N-1} \left(1 - \frac{\sum{i=1}^N \sigmai^2}{\sigma_T^2}\right)
Outlier threshold (univariate)
- |z| > 3.29 indicates an extreme value (p < 0.001)
Missing data handling (conceptual)
- MCAR: Missing Completely at Random; assessed via Little’s MCAR test
- EM: Expectation-Maximization imputation to fill in missing values for analyses

Final reminder

Tonight’s session focuses on getting a clean, analysable dataset ready for regression-based analyses and for creating scale totals. The practices learned here (data cleaning, missing data handling, reverse coding, total score computation, normality/linearity checks, and reliability analysis) are directly applicable to your assignment and real-world data work.
If anything is unclear, use the recorded session for review and reach out to the instructor with questions on points like reverse coding decisions, imputation choices, or how to report exclusions in your thesis.

References to ideas mentioned (conceptual anchors)

Path analysis and moderated/mediated regression are highlighted as central topics in early weeks; this aligns with modern quantitative approaches in psychology and social sciences.
The emphasis on SPSS with PROCESS macro reflects common practice for path analysis; Jamovi is discussed as an alternative but not the focus of this course.
The distinction between data cleaning for a classroom exercise and real-world data collection emphasizes ethical reporting and transparent data handling in academic work.