SDS CH3 - Empirical Rule, Chebyshev's Theorem, and Bivariate Data Relationships

The Empirical Rule: Review and Specific Conditions

  • The Empirical Rule defines percentages of data contained within certain standard deviations from the mean for specific distributions:     * One standard deviation from the mean accounts for 68%68\% of the data.     * Two standard deviations from the mean account for 95%95\% of the data.     * Three standard deviations from the mean account for 99.7%99.7\% of the data.

  • This rule is not universal and cannot be applied to any arbitrary dataset.

  • Specific Condition: The data must follow a symmetric and bell-shaped distribution.

  • While the bell shape often emerges in various fields, it is limited because many distributions are naturally asymmetrical, requiring different mathematical models.

Practical Application: SAT Scores Example

  • The variables and parameters for the SAT score example are:     * Variable x=SAT scoresx = \text{SAT scores}.     * Mean μ=500\mu = 500.     * Standard Deviation σ=90\sigma = 90.     * Distribution: Bell-shaped (allowing the use of the Empirical Rule).

  • One Standard Deviation Interval:     * Calculation: 500±90500 \pm 90     * Interval: (410,590)(410, 590)     * Data Content: 68%68\% of test takers score between 410410 and 590590.

  • Two Standard Deviation Interval:     * Calculation: 500±180500 \pm 180     * Interval: (320,680)(320, 680)     * Data Content: 95%95\% of test takers score between 320320 and 680680.

  • Three Standard Deviation Interval:     * Calculation: 500±270500 \pm 270     * Interval: (230,770)(230, 770)     * Data Content: 99.7%99.7\% of test takers score between 230230 and 770770.

  • Interpretation: A significant portion of the data is captured within one standard deviation. The three-standard-deviation range contains nearly all of the data.

Identification of Outliers

  • Data points falling outside the range of three standard deviations from the mean can be considered unusual enough to be warranted as outliers.

  • Methods of Identification:     * Z-Scores: Highlighting values that are three standard deviations away from the mean.     * Original Data Bounds: Calculating the actual numerical bounds representing three standard deviations from the mean to identify outliers in the original data units rather than transformed z-units.

Chebyshev's Rule: The General Case

  • Chebyshev's Rule is named after the mathematician who developed it and is a generalized rule that holds regardless of the data distribution shape.

  • Unlike the Empirical Rule, it does not require a symmetric bell-shaped distribution.

  • General Formula: The percentage of values falling within kk standard deviations of the mean is at least 11k21 - \frac{1}{k^2}.

  • Constraints: The rule is strictly for k > 1.     * If k=1k = 1, then 1112=01 - \frac{1}{1^2} = 0, meaning the rule cannot describe what happens within one standard deviation.

  • Trade-offs: This rule offers more generality but loses the specificity of the Empirical Rule.

  • Calculations for Common k-values:     * For k=2k = 2: 1122=0.751 - \frac{1}{2^2} = 0.75. At least 75%75\% of values will fall within two standard deviations of the mean.     * For k=3k = 3: 1132=0.88891 - \frac{1}{3^2} = 0.8889. At least 88.89%88.89\% of values will fall within three standard deviations of the mean.

  • Interpretive Note: The rule specifies "at least," meaning it provides a minimum floor. A distribution could contain more than the calculated percentage, but the rule covers all distribution shapes and therefore cannot be more precise.

Chebyshev's Rule Worked Example

  • Scenario: Given a mean of 5050 and a standard deviation of 55, find the bounds for the interval containing 95%95\% of the data where symmetry is not guaranteed.

  • Probability Statement: Because the rule is "at least," we use an inequality:     * P(\mu - k\sigma < x < \mu + k\sigma) \geq 1 - \frac{1}{k^2}

  • Step 1: Solve for k:     * 11k2=0.951 - \frac{1}{k^2} = 0.95     * 1k2=0.05- \frac{1}{k^2} = -0.05     * k2=10.05=20k^2 = \frac{1}{0.05} = 20     * k=204.4721k = \sqrt{20} \approx 4.4721     * (Using k=4.47k = 4.47 for calculation).

  • Step 2: Calculate the Bounds:     * Lower Bound: 50(4.47×5)=5022.35=27.6550 - (4.47 \times 5) = 50 - 22.35 = 27.65     * Upper Bound: 50+(4.47×5)=50+22.35=72.3550 + (4.47 \times 5) = 50 + 22.35 = 72.35

  • Final Statement: P(27.65 < x < 72.35) \geq 0.95.

Relationships Between Two Numerical Variables

  • Scatter Plots: A visual tool where coordinate pairs of two numeric variables (xx and yy) are plotted to identify trends (e.g., positive linear trends).

  • Covariance: A measure of the linear relationship between two numeric variables, specifically how they vary together.     * Sample Covariance Notation: Cov(x,y)\text{Cov}(x, y) or σxy\sigma_{xy}.     * Theoretic Formula: σxy=i=1n(xixˉ)(yiyˉ)n1\sigma_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n - 1}     * Computational Formula: xiyi1n(xi)(yi)n1\frac{\sum x_iy_i - \frac{1}{n} (\sum x_i)(\sum y_i)}{n - 1}

  • Interpreting the Sign of Covariance:     * Positive (Cov > 0): Positive linear relationship; xx and yy move in the same direction.     * Negative (Cov < 0): Negative linear relationship; xx and yy move in opposite directions.     * Zero (Cov = 0): No linear relationship; variables are independent. Graphically, this could be a horizontal line (yy is independent of xx) or random scattered points with no discernible shape.

  • Flaw of Covariance: It is difficult to interpret because it is not bounded. The magnitude depends on the units and scales of the data, making it impossible to determine the "strength" of a relationship from covariance alone.

Coefficient of Correlation

  • The Coefficient of Correlation measures both the direction and the strength of a linear relationship.

  • Sample Correlation coefficient (rr): r=Cov(x,y)sx×syr = \frac{\text{Cov}(x, y)}{s_x \times s_y}

  • Population Correlation coefficient: Represented by the Greek symbol ρ\rho (rho).

  • Properties:     * Bounded between 1-1 and 11.     * 1-1: Perfect negative correlation.     * 00: No linear correlation.     * 11: Perfect positive correlation.

  • Correlation Strength Examples:     * 0.8-0.8: Strong negative correlation.     * 0.3-0.3: Moderate negative correlation.     * 0.70.7: Strong positive correlation.     * 0.40.4: Moderate positive correlation.

  • Real-world Visualization:     * Perfect Relationship: Points form a straight line. There is a fixed way yy changes via xx.     * Strong (Not Perfect) Relationship: There is a discernible line through the data, but with some "spread" caused by other factors or randomness.     * Moderate Relationship: A line is still discernible, but with a significant spread of points away from the line.     * Zero Correlation: xx changes value, but yy remains independent.

Correlation versus Causation

  • No causal effect is implied by correlation: "Correlation does not equal causation."

  • Ice Cream and Shark Attack Example: A study might show high correlation between ice cream sales at the beach and shark attacks. This does not mean eating ice cream causes attacks. Instead, both are united by a hidden variable: the number of people present at the beach.

Professional Ethics and Objective Analysis

  • Data analysis must remain objective and neutral.

  • Proper Reporting: Summary measures must communicate important aspects accurately. For example, if data is skewed, reporting only the mean is misleading; both the mean and median should be reported to highlight the skewness.

  • Avoiding Subjectivity: Avoid emotional language in reporting (e.g., use "sales are trending down" instead of calling it a "disastrous crisis").

  • Ethical Constraints:     * Report both good and bad results.     * Do not handpick data or summary measures to distort facts.     * Remain fair, objective, and neutral.

Questions & Discussion

  • Q: Why did you use a probability statement to explain frequency?

  • A: In this module, the frequency of occurrence is related explicitly to the probability of occurrence. If 95%95\% of data falls within an interval, there is a 95%95\% chance (probability) that a value will be contained in that interval.

  • Q: Regarding the Chebyshev bound calculation, why multiply kk by standard deviation?

  • A: The rule specifies "within kk standard deviations of the mean," which is represented mathematically as k×σk \times \sigma. For the example with σ=5\sigma = 5 and k=4.47k = 4.47, the term becomes 4.47×54.47 \times 5.