Multivariate Data Analysis: Introduction & Data Exploration
Why Data Analysis?
Data analysis is prevalent in the media.
It is crucial for understanding and critically evaluating academic literature.
Essential for collecting and analyzing data to make inferences and explain phenomena.
Statistics is a tool, not an end. Theories/hypotheses are tested quantitatively to draw conclusions.
Different researchers may draw different conclusions from the same data.
Example: Pain intensity measurement (worst, least, average, last 24 hours pain) to assess treatment effectiveness via reduction in pain intensity.
Important considerations: How data is obtained, its meaning, therapy effectiveness, measurement errors, and retest reliability.
Observed pain reduction during treatment doesn't guarantee effectiveness due to natural pain fluctuations over time.
Data Analysis
Necessary for psychologists to:
Organize data (graphs, etc.).
Describe data using descriptive/deductive statistics (summarization).
Interpret data and make statements using inferential/inductive statistics (explanation).
Verify and adapt theories.
Inductive Statistics
Making statements about a population based on a sample.
Parameters (e.g., sample mean) are extracted to describe the sample.
Example: Fatal familial insomnia – studying the entire affected population allows for direct statements about the population.
Often, samples must be drawn because it is rare to access the entire population.
Key Concepts
Theory → Hypothesis → Sample → Sample Statistics
Sample Statistic: A measure based on sample data (e.g., arithmetic mean, proportion).
A sample statistic is a random variable with a specific distribution, called the sampling distribution.
Random sample 1 yields S1.
Random sample 2 (same n) yields S2.
And so on, up to Sn.
S1, S2, S3, S4, …, Sn
The distribution of these sample statistics = sampling distribution
Sample Distribution: Empirical frequency distribution of sample outcomes.
Sampling Distribution: Theoretical probability distribution of all possible values a sample statistic can take across all possible samples.
If the sample statistic is the sample mean (xˉ), then repeated random samples of size n from a normally distributed population with mean μ and standard deviation σ results in a sampling distribution of the sample mean that is normally distributed: N(μ,nσ)
Central Limit Theorem
When repeated random samples of size n are drawn from any arbitrarily distributed population with mean μ and standard deviation σ, then if n is sufficiently large (rule of thumb: n≥30), the sampling distribution of the sample mean approximates a normal distribution: N(μ,nσ)
Data Exploration Overview
Graphical data Exploration.
Analysis of Missing Data.
Outliers.
Assumptions:
Normality
Homoscedasticity
Linearity
Data Transformations.
Dummy Coding.
Conclusion.
Eyeballing Data
Common among those unfamiliar with statistics.
Involves using charts like pie diagrams that may not be representative or tested.
Important for identifying errors in the data.
Graphs should include key parameters like spread, mean, error bars, and confidence intervals.
Graphical Exploration of Data
Distribution analysis using:
Histograms
Stem-and-leaf diagrams
Box plots
Provides a global overview.
Boxplot Interpretation
Provides information about position, spread, and symmetry.
Skewness:
Right-skewed (positive)
Left-skewed (negative)
Uniform data results in larger boxes.
Histogram Interpretation
Provides information about the normality of the distribution.
Graph Aesthetics
Images are typically viewed before results.
Data visualization should be clear, accurate, and appealing.
Good visualizations can reduce the amount of necessary text in research.
Analysis of Missing Data
Causes of Missing Data
Independent of Respondent
Procedure-related (e.g., branching questionnaires where some questions are skipped based on prior responses).
Coding errors during data entry.
Dependent on Respondent
Reluctance to answer (e.g., income questions, especially in the private sector).
Scope of Missing Data
Determining the quantity of missing data (few vs. many missing values) is important for assessing the impact on analysis.
Analyze the missing data profile to identify patterns (systematic or random).
Impact of Missing Data
Practical Impact
Reduction in sample size (listwise deletion).
If excessive, consider increasing N or addressing the missingness.
Nonrandom Missingness
Can introduce bias.
May exclude specific groups from the analysis (e.g., high-income individuals).
This bias is only detectable if missing data is studied.
Types of Missing Data
Negligible Missing Data
Expected and often part of the procedure.
Includes:
Data from individuals not in the sample.
Skip patterns.
Censored data (e.g., participant death during data collection).
Does not require remediation but may require design adjustments.
Non-Negligible Missing Data
Coding errors.
Incomplete questionnaires due to time constraints.
Respondent death.
UNKNOWN non-negligible MD.
Refusal to answer sensitive questions or 'no opinion'.
Related to procedural factors.
Difficult to control or remediate.
Related to the respondent.
Determining the Severity of Missing Data
If the extent is very small:
<10% per case: acceptable.
Sufficient cases without missing data.
No non-randomness: any remedy is acceptable.
If the extent is large:
Investigate randomness.
Conduct sensitivity analyses using smaller samples.
If methods converge, data is likely correct.
If they diverge, determine the most accurate method.
Investigating Randomness in Missing Data
Missing Completely at Random (MCAR)
Missing data is randomly distributed across subgroups.
The probability of missing data is the same for everyone in the sample.
The cause of missing data is independent of the data itself.
Any remedy is acceptable, but imputing the mean can alter data variance.
Very rare.
Missing at Random (MAR)
Within subgroups, missing data is random but differs between groups.
Missing data depends on other variables.
Example: Income data missing in a study predicting income based on education.
If missing among the least educated → MAR.
If missing among the highest incomes → MNAR (Missing Not At Random).
MAR Definition (per CHATGPT)
"Missing at random" means that missing values depend on other values in the dataset, but are independent of the missing values themselves.
Reason for missingness is related to observed data, not the unobserved missing data.
Example: If education level is missing but related to age and income, it's MAR because missingness depends on age and income, not the missing education level itself.
Missing Data Mechanisms Explained
Missing Completely at Random (MCAR): The reason for missing data is unrelated to observed or unobserved data. The probability of missing data is the same for all units, independent of both observed and missing data. E.g., a survey randomly stopped due to a power outage.
Missing at Random (MAR): The reason for missing data is related to observed data but not the missing data itself. Given the observed data, the probability of missing data is the same for all units. E.g., men are less likely to answer an emotion question, dependent on gender (observed) and not the actual emotions (missing).
Missing NOT at Random (MNAR): The reason for missing data is related to the unobserved (missing) data itself. The probability of missing data depends on both observed and unobserved data. E.g., low-income people are less likely to report their income; the reason for missingness depends on the actual income (unobserved).
How to Assess Missing Data
Visually inspect the data.
Use diagnostic tests:
Compare cases with missing data on variable Y to those without missing data; do they differ on other variables (e.g., t-tests)?
Recode: Valid response = 1; missing = 0; then compute a point-biserial correlation (Pearson correlation) to see if missingness is related to other variables.
Runs Test and Test for serial correlation
Spectral analysis
Chi-square test for randomness
Managing Missing Data
Avoid missing data by checking questionnaires and ensuring careful coding.
Use standard listwise deletion (only complete cases) and be transparent about this decision in reporting.
Delete cases and/or variables if missingness is random.
For MAR or MCAR, use imputation to replace missing data, considering data spread.
Use all available information (pairwise deletion).
Replace missing data with comparable cases, averages, or estimated values via regression.
Missing Data Process
Figure 2.5 outlines a four-step process for identifying missing data and applying appropriate remedies:
Determine the type of missing data (ignorable vs. non-ignorable).
Determine the extent of missing data (substantial enough to warrant action?).
Diagnose the randomness of the missing data processes (MAR or MCAR).
Select the appropriate imputation method or data application method.
Outliers
Observations that are distinctly different from others.
Can significantly impact analysis and interpretation.
Assess outlier representativeness for the population.
Outliers are not always negative, and understanding why they occur can be useful.
Removing outliers can better describe the data, but both with and without outliers, results should be considered.
Practical Impact
Large difference between mean and median indicates a skewed distribution.
Normal distributions have coinciding mean and median.
Outliers greatly affect spread (min/max values).
When defining outliers, is best to transform values or replace them with an "