Lesson 3 Study Notes: Describing Data with IQR, Correlation, and Regression

Lesson Overview and Learning Objectives

  • This lesson is the second of two covering the description of data.
  • The lecture utilizes Minitab Express for the majority of graphical constructions and statistical calculations.
  • Hand calculations for this lesson are specifically limited to:     * Using the Interquartile Range (IQR) method to identify outliers.     * Computing residuals for linear regression models.
  • There are eight core learning objectives for this week:     1. Construct and interpret a box plot and side-by-side box plots.     2. Use the IQR method to identify outliers.     3. Construct and interpret histograms with groups and dot plots with groups.     4. Construct and interpret a scatter plot.     5. Compute and interpret a correlation.     6. Construct and interpret a simple linear regression model.     7. Compute and interpret residuals from a simple linear regression model.     8. Interpret plots of more than two variables.

Objective 1: Box Plots and Side-by-Side Box Plots

  • Box Plot Definition: A box plot (also known as a box-and-whisker plot) is a graphical representation of the five-number summary of a dataset.

  • Components of a Box Plot:     * Outliers: Represented by stars on either end of the plot. These are points identified as significantly distant from the rest of the data.     * Minimum: The lowest point on the far left (horizontal) or bottom (vertical) that is not considered an outlier.     * First Quartile (Q1Q1): The 25th percentile; forms the bottom or left boundary of the box.     * Median: The line located inside the box, representing the 50th percentile.     * Third Quartile (Q3Q3): The 75th percentile; forms the top or right boundary of the box.     * Maximum: The highest point on the far right (horizontal) or top (vertical) that is not considered an outlier.     * The Box: Represents the middle 50% of observations within the distribution.

  • Orientation:     * Horizontal Box Plots: Often used to align with numerical summaries.     * Vertical Box Plots: The default orientation in Minitab Express.

  • Side-by-Side Box Plots:     * Used when comparing one quantitative variable (e.g., height) across different levels of one categorical grouping variable (e.g., biological sex).     * Interpretation:         * Compare the central tendency by looking at the medians.         * Compare the variability by looking at the height/size of the boxes, which represents the Interquartile Range (IQRIQR).     * Example Comparison (Height by Biological Sex):         * If the female box and median are lower on the axis than the male box and median, the central tendency for height is higher for males.         * If the boxes are approximately the same size, the variability (IQR) is similar for both groups.

Objective 2: The Interquartile Range (IQR) Method for Outliers

  • Outlier Definition: Values within a dataset that fall outside the general scope of other observations.
  • The IQR Method: An objective, mathematical method used by Minitab Express to determine which points to label as outliers in a box plot.
  • The IQR Formula: The range of the middle 50% of observations.     * IQR=Q3Q1IQR = Q3 - Q1
  • Setting Up "Fences": Any observation falling outside these calculated limits is considered an outlier.     * Fence Distance: Each fence is set at 1.5×IQR1.5 \times IQR away from the quartiles.     * Lower Fence: Q11.5×IQRQ1 - 1.5 \times IQR     * Upper Fence: Q3+1.5×IQRQ3 + 1.5 \times IQR
  • Interpretation of Fences:     * Observations less than the lower fence are outliers.     * Observations greater than the upper fence are outliers.
Hand Calculation Example (Quiz Scores)
  • Data (n=16n=16): 4, 8, 10, 11, 13, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 17.
  • Step 1 (Find Median): Between the 8th and 9th observation. Here, both are 14, so Median = 14.
  • Step 2 (Find Q1Q1): The middle of the lower half (8 observations). It falls between 11 and 13.     * Q1=11+132=12Q1 = \frac{11 + 13}{2} = 12
  • Step 3 (Find Q3Q3): The middle of the top half (8 observations). It falls between 15 and 15.     * Q3=15Q3 = 15
  • Step 4 (Calculate IQRIQR):     * 1512=315 - 12 = 3
  • Step 5 (Determine Fence Factor):     * 1.5×3=4.51.5 \times 3 = 4.5
  • Step 6 (Calculate Fences):     * Lower Fence: 124.5=7.512 - 4.5 = 7.5     * Upper Fence: 15+4.5=19.515 + 4.5 = 19.5
  • Conclusion: The score of 4 is lower than 7.5 and is an outlier. There are no scores above 19.5, so there are no upper-end outliers.

Objective 3: Histograms and Dot Plots with Groups

  • Used for displaying one quantitative variable and one categorical variable.
  • Histograms with Groups: Displays multiple histograms (one for each group). Minitab Express keeps the x-axis scale consistent across groups to facilitate comparison.
  • Dot Plots with Groups: Displays stacks of dots for each group, often vertically stacked.
  • Comparison Advantage: Dot plots are generally preferred for visual comparison because the groups are stacked directly on top of each other, making differences in distribution and central tendency more apparent than the left-to-right alignment in histograms.

Objective 4: Scatter Plots

  • Purpose: To display the relationship between two quantitative variables.
  • Variables:     * Y-Variable: The response variable (dependent variable).     * X-Variable: The explanatory variable (independent/predictor variable). It is used to predict or explain variability in the response variable.
Interpreting Scatter Plots (4 Key Factors)
  1. Direction:     * Positive Relationship: Moves from bottom-left to upper-right (positive slope).     * Negative Relationship: Moves from upper-left to lower-right (negative slope).     * No Relationship: A flat, horizontal pattern.
  2. Form (Shape):     * Linear: The points follow a straight line (the focus of this course).     * Nonlinear (Curvilinear): Points follow a curved pattern (e.g., the Yerkes-Dodson Law where anxiety increases performance up to a point, then decreases it).
  3. Strength:     * Determined by how closely the observations cluster around the line of best fit. Closely clustered points indicate a strong relationship; spread-out points indicate a weak relationship.
  4. Bivariate Outliers:     * An observation that does not fit the general pattern of the rest of the observations for both variables combined.

Objective 5: Correlation (Pearson’s rr)

  • Required Conditions:     * Exactly two quantitative variables.     * A linear relationship between variables.
  • Properties of Pearson’s rr:     * The value of rr is always between 1-1 and +1+1.     * Direction: Sign of rr (+\/\text{or}\/-) indicates the direction of the relationship.     * Strength: The closer to 1-1 or +1+1, the stronger the relationship. The closer to 00, the weaker the relationship. (Example: 0.8-0.8 is stronger than +0.6+0.6).     * Standardization: Correlation is unit-free. Variables do not need to be on the same scale (e.g., correlating inches and pounds).     * Interchangeability: The correlation between xx and yy is identical to the correlation between yy and xx.
Strength Interpretation Table (Absolute Value of rr)
Range of r|r|Strength Interpretation
0.00.20.0 - 0.2Very Weak
0.20.40.2 - 0.4Weak
0.40.60.4 - 0.6Moderately Strong
0.60.80.6 - 0.8Strong
0.81.00.8 - 1.0Very Strong
Cautions and Misconceptions
  • Correlation does not equal causation: A strong association does not prove that one variable causes the other; a causal conclusion requires a well-designed experiment.
  • Linearity: rr only measures linear relationships. Scatter plots should be checked before calculating rr.
  • Influence of Outliers: Pearson’s rr is heavily influenced by outliers, which can make the correlation appear stronger or weaker than it actually is.
Mathematical Basis
  • Pearson's rr can be expressed in terms of zz-scores:     * r(zx×zy)n1r \approx \frac{\sum (z_x \times z_y)}{n - 1}     * The sum is positive if both zz-scores are positive (above the mean) or both are negative (below the mean).

Objective 6: Simple Linear Regression

  • Definition: "Simple" means one explanatory variable (xx) and one response variable (yy). "Linear" means a straight-line model.

  • Usage: In regression, the choice of xx and yy matters; xx must be the predictor and yy the outcome.

  • Algebra vs. Statistics Notation:     * Algebra: y=mx+by = mx + b (where mm is slope and bb is intercept).     * Statistics (Sample): y^=a+bx\hat{y} = a + bx         * y^\hat{y}: Predicted or estimated response value.         * aa: Y-intercept (the value of yy when x=0x = 0).         * bb: Slope (the change in yy for a one-unit change in xx).     * Statistics (Population): y^=α+βx\hat{y} = \alpha + \beta x or y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x

  • Least Squares Method: The statistical method used to find the line that minimizes the Sum of Squared Errors (SSE).

Objective 7: Residuals

  • Residual Definition (ee): The error or difference between the observed value and the predicted value.
  • Formula: e=yy^e = y - \hat{y}     * yy = observed value.     * y^\hat{y} = value predicted by the regression equation.
  • Visual Interpretation: The vertical distance from a data point to the regression line on a scatter plot.     * Positive Residual: Point is above the regression line.     * Negative Residual: Point is below the regression line.
  • Least Squares Criteria: Minimizes e2\sum e^2 or (yy^)2\sum (y - \hat{y})^2.
Cautions in Regression
  1. Extrapolation: Do not use the regression model to predict values for xx that are outside the range of the original data. There is no evidence the relationship remains linear beyond the studied range.
  2. Check for Linearity: Always use a scatter plot to ensure a straight line is an appropriate model.
  3. Outliers: Outliers can drastically shift the slope and intercept of the regression line.

Objective 8: Visualizing More Than Two Variables

  1. Scatter Plots with Groups:     * Displays two quantitative variables (x,yx, y) and one categorical variable.     * Uses different markers (colors/shapes) to represent groups (e.g., blue circles for females, red squares for males).
  2. Bubble Plots:     * Displays three quantitative variables.     * xx and yy axes represent two variables, while the size of the bubble represents the third quantitative variable.     * Color can be added to represent a fourth categorical variable.
  3. Time Series Plots:     * Displays changes in a quantitative variable over time.     * X-axis: Always represents the measure of time (years, months, etc.).     * Y-axis: Represents the quantitative response variable.     * Time Series with Groups: Allows comparison of multiple entities across the same time period using different colored lines and markers.