Statistics Exam Review: Variables, Graphs, Sampling, and Regression

Differentiating Variable Types and Graphs

  • Quantitative Variables

    • Definition: Variables with numerical values where mathematical operations (e.g., addition, averaging) are meaningful.

    • Example: Body Mass Index (BMI) if measured as a continuous number.

  • Categorical Variables

    • Definition: Variables that place individuals into groups or categories.

    • Examples: Religion, Preferred Shopping Method.

    • Ordinal Categorical Variable: A categorical variable where categories have a natural, ordered progression.

      • Example: Body Mass Index (BMI) grouped into categories like 'underweight', 'normal weight', 'overweight', 'obese' – this implies a low to high order.

/

/

Graphing Variable Types

  • Bar Graphs

    • Used for: Categorical variables.

    • Axis: The horizontal axis typically displays words or categories.

  • Histograms

    • Used for: Quantitative variables.

    • Axis: The horizontal axis displays a number line.

  • Contingency Tables (Two-Way Tables)

    • Used for: Summarizing the relationship between two categorical variables.

    • Example: Customer age group vs. preferred shopping method (both categorical).

  • Box Plots

    • Used for: Displaying the distribution of quantitative variables.

    • Side-by-side box plots are used to compare a quantitative variable across different groups defined by a categorical variable.

  • Center and Spread Measures

    • Used for: Summarizing quantitative variables.

  • Correlation

    • Used for: Measuring the linear relationship between two quantitative variables.

Understanding Skewness and Outliers

  • Skewness

    • Right-Skewed: The 'tail' of the distribution points to the right (higher values). The mean is typically greater than the median ( ext{Mean} > ext{Median}) because the mean is pulled in the direction of the skew/outliers.

    • Left-Skewed: The 'tail' of the distribution points to the left (lower values). This means there is a long whisker towards the low values on a box plot. The mean is typically less than the median ( ext{Mean} < ext{Median}).

  • Outliers

    • Mild vs. Extreme: Outliers can be identified visually on a box plot (points beyond the whiskers) or calculated using specific formulas related to the Interquartile Range (IQR).

    • Formula for Extreme Upper Outlier: Q_3 + 3 imes IQR

    • Formula for Extreme Lower Outlier: Q_1 - 3 imes IQR

    • Interquartile Range (IQR): IQR = Q3 - Q1 (where Q3 is the third quartile and Q1 is the first quartile)

Sampling and Bias

  • Population: The entire group that a researcher is interested in (e.g., all Dallas residents).

  • Sample Frame: The list or source from which a sample is drawn (e.g., students living in a specific residence hall).

  • Sample: The subset of the population from whom data is actually collected.

  • Data/Values: The specific responses or measurements obtained from the sample.

  • Statistic vs. Parameter

    • Statistic: A numerical summary that describes a sample (e.g., 60% of sampled individuals).

    • Parameter: A numerical summary that describes an entire population (e.g., 53% of all individuals in the population).

  • Types of Bias

    • Undercoverage Bias: Occurs when the sample frame does not adequately represent the entire population of interest (e.g., sampling residence hall students to learn about all Dallas residents).

    • Nonresponse Bias: Occurs when a significant portion of a selected sample does not respond to a survey or drops out of a study, and these non-respondents differ systematically from those who do respond.

  • Sampling Methods

    • Convenience Sample: A sample consisting of individuals who are readily available or easy to reach.

    • Stratified Sampling: Dividing the population into distinct subgroups (strata) based on some shared characteristic, and then drawing separate samples from each stratum (e.g., sampling 100 men and 100 women separately).

Lurking Variables

  • Definition: A variable that is not included in the study but may be influencing the relationship between the explanatory and response variables, making it seem as though there is a direct relationship between them when there isn't, or obscuring a real relationship.

  • A variable cannot be a lurking variable if it was measured and included in the study.

    • Example: If 'employment status' was measured, it cannot be a lurking variable.

Explanatory and Response Variables

  • Explanatory Variable (Independent Variable): The variable that is thought to influence or explain changes in another variable.

    • Example: Greek life affiliation (categorical: yes/no).

  • Response Variable (Dependent Variable): The variable that measures the outcome of a study; it is thought to be affected by the explanatory variable.

    • Example: GPA (quantitative).

Regression Analysis

  • Regression Equation: Typically in the form of ext{y-hat} = ext{Intercept} + ext{Slope} imes X.

    • Values are extracted from the regression output table (e.g., 'Intercept' row for y-intercept, 'X variable' coefficient for slope).

  • Y-intercept Interpretation

    • General: The predicted value of the response variable (y-hat) when the explanatory variable (X) is equal to zero.

    • Caution: Only interpret the y-intercept if X=0 is meaningful and within the range of observed data. If X=0 is not a plausible or meaningful value (e.g., a car weighing 0 pounds), then the intercept should not be interpreted in context.

  • Slope Interpretation

    • The predicted change in the response variable (y-hat) for every one-unit increase in the explanatory variable (X).

  • Residual

    • Definition: The difference between the observed value (Y) and the predicted value (y-hat) from the regression line.

    • Formula: ext{Residual} = Y{observed} - ext{y-hat}{predicted}

    • Steps for calculation:

      1. Determine the regression equation from output. ext{y-hat} = -147.778 + 0.118 imes ext{Weight}

      2. Check for extrapolation: Ensure the X-value for prediction is within the range of the observed data. If outside the range, prediction is unreliable (e.g., predicting for a 2,000-pound car is okay if 2,000 is within the observed weight range).

      3. Plug the X-value (e.g., Weight = 2,000) into the equation to find ext{y-hat}.

      4. Subtract ext{y-hat} from the actual observed Y-value (e.g., Displacement = 90).

      5. Example: For Weight = 2,000, ext{y-hat} = -147.778 + 0.118 imes 2000 = 88.222. If observed Y = 90, then Residual = 90 - 88.222 = 1.778.

  • Coefficient of Determination (R^2)

    • Definition: The proportion (or percentage) of the variation in the response variable (Y) that can be explained by the explanatory variable (X) in the linear regression model.

    • Formula: Represented as R^2

    • Interpretation: If R^2 = 0.677, then 67.7% of the variation in [response variable] is explained by [explanatory variable].

    • Distinction: Do not confuse with correlation (R), which measures direction and strength but is not explicitly a percentage of variation explained.

Assumptions for Linear Regression

Mismatched patterns in residual plots indicate assumption violations:

  • Linearity: If the relationship between X and Y is not linear, residuals will show a curved pattern (e.g., a 'U' shape) rather than a random scatter around zero.

  • Equal Variance (Homoscedasticity): If the variability of the residuals changes across the range of X values, residuals will show a fan or cone shape (wider at one end than the other).

  • Normality of Residuals: While not directly visible in a simple residual plot, a strong vertical imbalance or skewness in the plot might suggest non-normality. (Often checked with a normal probability plot or Q-Q plot of residuals).

  • Independence of Observations: Not discernible from a residual plot; typically assessed based on the study design and data collection methods.

Types of Studies and Experimental Design

  • Observational Study

    • Method: Researchers observe individuals and measure variables of interest without intervening or imposing treatments.

    • Key characteristic: Involves random selection of individuals to be observed.

    • Conclusion: Can establish association, but not causation.

  • Experimental Study (Experiment)

    • Method: Researchers intentionally intervene by imposing some treatment on individuals and then observe their responses.

    • Key characteristic: Involves random assignment of individuals to different treatment groups.

    • Conclusion: Can establish causation, given a well-designed experiment.

    • Prospective Study: Follows subjects forward in time from a starting point.

  • Elements of Experimental Design

    • Factors: The explanatory variables being manipulated by the researcher (e.g., salesperson demeanor, brand status).

    • Levels: The specific values or categories within each factor (e.g., two levels for demeanor, two for brand status).

    • Treatments: The specific combination of levels of the factors (e.g., 2 demeanor levels * 2 brand status levels = 4 treatments).

    • Experimental Units: The individuals or subjects to whom the treatments are applied (e.g., migraine sufferers).

    • Response Variable: The outcome measured after applying the treatments (e.g., number of migraines).

  • Random Assignment

    • Completely Randomized Experiment: Experimental units are assigned to treatment groups entirely by chance.

    • Block Design: Experimental units are first sorted into homogeneous groups (blocks) based on some pre-existing characteristic, and then random assignment to treatments occurs within each block.

  • Control over Bias

    • Placebo: A dummy treatment given to a control group to account for psychological effects (the placebo effect).

    • Blinding: Concealing treatment assignments from participants and/or researchers to prevent bias.

      • Single-blind: Participants do not know which treatment they are receiving.

      • Double-blind: Neither the participants nor the researchers directly interacting with them (and recording data) know who received which treatment. This is crucial when a placebo is used.

Causation and Reliability

  • Causation: Can only be inferred from a well-designed experiment that includes random assignment to treatment groups.

  • Reliability: Larger sample sizes generally lead to more statistically reliable and robust conclusions.