Statistics Exam Review: Variables, Graphs, Sampling, and Regression

Differentiating Variable Types and Graphs

Quantitative Variables
- Definition: Variables with numerical values where mathematical operations (e.g., addition, averaging) are meaningful.
- Example: Body Mass Index (BMI) if measured as a continuous number.
Categorical Variables
- Definition: Variables that place individuals into groups or categories.
- Examples: Religion, Preferred Shopping Method.
- Ordinal Categorical Variable: A categorical variable where categories have a natural, ordered progression.
  - Example: Body Mass Index (BMI) grouped into categories like 'underweight', 'normal weight', 'overweight', 'obese' – this implies a low to high order.

Graphing Variable Types

Bar Graphs
- Used for: Categorical variables.
- Axis: The horizontal axis typically displays words or categories.
Histograms
- Used for: Quantitative variables.
- Axis: The horizontal axis displays a number line.
Contingency Tables (Two-Way Tables)
- Used for: Summarizing the relationship between two categorical variables.
- Example: Customer age group vs. preferred shopping method (both categorical).
Box Plots
- Used for: Displaying the distribution of quantitative variables.
- Side-by-side box plots are used to compare a quantitative variable across different groups defined by a categorical variable.
Center and Spread Measures
- Used for: Summarizing quantitative variables.
Correlation
- Used for: Measuring the linear relationship between two quantitative variables.

Understanding Skewness and Outliers

Skewness
- Right-Skewed: The 'tail' of the distribution points to the right (higher values). The mean is typically greater than the median ( ext{Mean} > ext{Median}) because the mean is pulled in the direction of the skew/outliers.
- Left-Skewed: The 'tail' of the distribution points to the left (lower values). This means there is a long whisker towards the low values on a box plot. The mean is typically less than the median ( ext{Mean} < ext{Median}).
Outliers
- Mild vs. Extreme: Outliers can be identified visually on a box plot (points beyond the whiskers) or calculated using specific formulas related to the Interquartile Range (IQR).
- Formula for Extreme Upper Outlier: $Q_3 + 3 imes IQR$
- Formula for Extreme Lower Outlier: $Q_1 - 3 imes IQR$
- Interquartile Range (IQR): $IQR = Q3 - Q1$ (where $Q3$ is the third quartile and $Q1$ is the first quartile)

Sampling and Bias

Population: The entire group that a researcher is interested in (e.g., all Dallas residents).
Sample Frame: The list or source from which a sample is drawn (e.g., students living in a specific residence hall).
Sample: The subset of the population from whom data is actually collected.
Data/Values: The specific responses or measurements obtained from the sample.
Statistic vs. Parameter
- Statistic: A numerical summary that describes a sample (e.g., 60% of sampled individuals).
- Parameter: A numerical summary that describes an entire population (e.g., 53% of all individuals in the population).
Types of Bias
- Undercoverage Bias: Occurs when the sample frame does not adequately represent the entire population of interest (e.g., sampling residence hall students to learn about all Dallas residents).
- Nonresponse Bias: Occurs when a significant portion of a selected sample does not respond to a survey or drops out of a study, and these non-respondents differ systematically from those who do respond.
Sampling Methods
- Convenience Sample: A sample consisting of individuals who are readily available or easy to reach.
- Stratified Sampling: Dividing the population into distinct subgroups (strata) based on some shared characteristic, and then drawing separate samples from each stratum (e.g., sampling 100 men and 100 women separately).

Lurking Variables

Definition: A variable that is not included in the study but may be influencing the relationship between the explanatory and response variables, making it seem as though there is a direct relationship between them when there isn't, or obscuring a real relationship.
A variable cannot be a lurking variable if it was measured and included in the study.
- Example: If 'employment status' was measured, it cannot be a lurking variable.

Explanatory and Response Variables

Explanatory Variable (Independent Variable): The variable that is thought to influence or explain changes in another variable.
- Example: Greek life affiliation (categorical: yes/no).
Response Variable (Dependent Variable): The variable that measures the outcome of a study; it is thought to be affected by the explanatory variable.
- Example: GPA (quantitative).

Regression Analysis

Regression Equation: Typically in the form of $ext{y-hat} = ext{Intercept} + ext{Slope} imes X$ .
- Values are extracted from the regression output table (e.g., 'Intercept' row for y-intercept, 'X variable' coefficient for slope).
Y-intercept Interpretation
- General: The predicted value of the response variable (y-hat) when the explanatory variable (X) is equal to zero.
- Caution: Only interpret the y-intercept if $X=0$ is meaningful and within the range of observed data. If $X=0$ is not a plausible or meaningful value (e.g., a car weighing 0 pounds), then the intercept should not be interpreted in context.
Slope Interpretation
- The predicted change in the response variable (y-hat) for every one-unit increase in the explanatory variable (X).
Residual
- Definition: The difference between the observed value (Y) and the predicted value (y-hat) from the regression line.
- Formula: $ext{Residual} = Y{observed} - ext{y-hat}{predicted}$
- Steps for calculation:
 1. Determine the regression equation from output. $ext{y-hat} = -147.778 + 0.118 imes ext{Weight}$
 2. Check for extrapolation: Ensure the X-value for prediction is within the range of the observed data. If outside the range, prediction is unreliable (e.g., predicting for a 2,000-pound car is okay if 2,000 is within the observed weight range).
 3. Plug the X-value (e.g., Weight = 2,000) into the equation to find $ext{y-hat}$ .
 4. Subtract $ext{y-hat}$ from the actual observed Y-value (e.g., Displacement = 90).
 5. Example: For Weight = 2,000, $ext{y-hat} = -147.778 + 0.118 imes 2000 = 88.222$ . If observed Y = 90, then Residual = $90 - 88.222 = 1.778$ .
Coefficient of Determination ( $R^2$ )
- Definition: The proportion (or percentage) of the variation in the response variable (Y) that can be explained by the explanatory variable (X) in the linear regression model.
- Formula: Represented as $R^2$
- Interpretation: If $R^2 = 0.677$ , then $67.7$ % of the variation in [response variable] is explained by [explanatory variable].
- Distinction: Do not confuse with correlation (R), which measures direction and strength but is not explicitly a percentage of variation explained.

Assumptions for Linear Regression

Mismatched patterns in residual plots indicate assumption violations:

Linearity: If the relationship between X and Y is not linear, residuals will show a curved pattern (e.g., a 'U' shape) rather than a random scatter around zero.
Equal Variance (Homoscedasticity): If the variability of the residuals changes across the range of X values, residuals will show a fan or cone shape (wider at one end than the other).
Normality of Residuals: While not directly visible in a simple residual plot, a strong vertical imbalance or skewness in the plot might suggest non-normality. (Often checked with a normal probability plot or Q-Q plot of residuals).
Independence of Observations: Not discernible from a residual plot; typically assessed based on the study design and data collection methods.

Types of Studies and Experimental Design

Observational Study
- Method: Researchers observe individuals and measure variables of interest without intervening or imposing treatments.
- Key characteristic: Involves random selection of individuals to be observed.
- Conclusion: Can establish association, but not causation.
Experimental Study (Experiment)
- Method: Researchers intentionally intervene by imposing some treatment on individuals and then observe their responses.
- Key characteristic: Involves random assignment of individuals to different treatment groups.
- Conclusion: Can establish causation, given a well-designed experiment.
- Prospective Study: Follows subjects forward in time from a starting point.
Elements of Experimental Design
- Factors: The explanatory variables being manipulated by the researcher (e.g., salesperson demeanor, brand status).
- Levels: The specific values or categories within each factor (e.g., two levels for demeanor, two for brand status).
- Treatments: The specific combination of levels of the factors (e.g., 2 demeanor levels * 2 brand status levels = 4 treatments).
- Experimental Units: The individuals or subjects to whom the treatments are applied (e.g., migraine sufferers).
- Response Variable: The outcome measured after applying the treatments (e.g., number of migraines).
Random Assignment
- Completely Randomized Experiment: Experimental units are assigned to treatment groups entirely by chance.
- Block Design: Experimental units are first sorted into homogeneous groups (blocks) based on some pre-existing characteristic, and then random assignment to treatments occurs within each block.
Control over Bias
- Placebo: A dummy treatment given to a control group to account for psychological effects (the placebo effect).
- Blinding: Concealing treatment assignments from participants and/or researchers to prevent bias.
  - Single-blind: Participants do not know which treatment they are receiving.
  - Double-blind: Neither the participants nor the researchers directly interacting with them (and recording data) know who received which treatment. This is crucial when a placebo is used.

Causation and Reliability

Causation: Can only be inferred from a well-designed experiment that includes random assignment to treatment groups.
Reliability: Larger sample sizes generally lead to more statistically reliable and robust conclusions.