SDS CH3 - Empirical Rule, Chebyshev's Theorem, and Bivariate Data Relationships

The Empirical Rule: Review and Specific Conditions

The Empirical Rule defines percentages of data contained within certain standard deviations from the mean for specific distributions: * One standard deviation from the mean accounts for $68\%$ of the data. * Two standard deviations from the mean account for $95\%$ of the data. * Three standard deviations from the mean account for $99.7\%$ of the data.
This rule is not universal and cannot be applied to any arbitrary dataset.
Specific Condition: The data must follow a symmetric and bell-shaped distribution.
While the bell shape often emerges in various fields, it is limited because many distributions are naturally asymmetrical, requiring different mathematical models.

Practical Application: SAT Scores Example

The variables and parameters for the SAT score example are: * Variable $x = \text{SAT scores}$ . * Mean $\mu = 500$ . * Standard Deviation $\sigma = 90$ . * Distribution: Bell-shaped (allowing the use of the Empirical Rule).
One Standard Deviation Interval: * Calculation: $500 \pm 90$ * Interval: $(410, 590)$ * Data Content: $68\%$ of test takers score between $410$ and $590$ .
Two Standard Deviation Interval: * Calculation: $500 \pm 180$ * Interval: $(320, 680)$ * Data Content: $95\%$ of test takers score between $320$ and $680$ .
Three Standard Deviation Interval: * Calculation: $500 \pm 270$ * Interval: $(230, 770)$ * Data Content: $99.7\%$ of test takers score between $230$ and $770$ .
Interpretation: A significant portion of the data is captured within one standard deviation. The three-standard-deviation range contains nearly all of the data.

Identification of Outliers

Data points falling outside the range of three standard deviations from the mean can be considered unusual enough to be warranted as outliers.
Methods of Identification: * Z-Scores: Highlighting values that are three standard deviations away from the mean. * Original Data Bounds: Calculating the actual numerical bounds representing three standard deviations from the mean to identify outliers in the original data units rather than transformed z-units.

Chebyshev's Rule: The General Case

Chebyshev's Rule is named after the mathematician who developed it and is a generalized rule that holds regardless of the data distribution shape.
Unlike the Empirical Rule, it does not require a symmetric bell-shaped distribution.
General Formula: The percentage of values falling within $k$ standard deviations of the mean is at least $1 - \frac{1}{k^2}$ .
Constraints: The rule is strictly for k > 1. * If $k = 1$ , then $1 - \frac{1}{1^2} = 0$ , meaning the rule cannot describe what happens within one standard deviation.
Trade-offs: This rule offers more generality but loses the specificity of the Empirical Rule.
Calculations for Common k-values: * For $k = 2$ : $1 - \frac{1}{2^2} = 0.75$ . At least $75\%$ of values will fall within two standard deviations of the mean. * For $k = 3$ : $1 - \frac{1}{3^2} = 0.8889$ . At least $88.89\%$ of values will fall within three standard deviations of the mean.
Interpretive Note: The rule specifies "at least," meaning it provides a minimum floor. A distribution could contain more than the calculated percentage, but the rule covers all distribution shapes and therefore cannot be more precise.

Chebyshev's Rule Worked Example

Scenario: Given a mean of $50$ and a standard deviation of $5$ , find the bounds for the interval containing $95\%$ of the data where symmetry is not guaranteed.
Probability Statement: Because the rule is "at least," we use an inequality: * P(\mu - k\sigma < x < \mu + k\sigma) \geq 1 - \frac{1}{k^2}
Step 1: Solve for k: * $1 - \frac{1}{k^2} = 0.95$ * $- \frac{1}{k^2} = -0.05$ * $k^2 = \frac{1}{0.05} = 20$ * $k = \sqrt{20} \approx 4.4721$ * (Using $k = 4.47$ for calculation).
Step 2: Calculate the Bounds: * Lower Bound: $50 - (4.47 \times 5) = 50 - 22.35 = 27.65$ * Upper Bound: $50 + (4.47 \times 5) = 50 + 22.35 = 72.35$
Final Statement: P(27.65 < x < 72.35) \geq 0.95.

Relationships Between Two Numerical Variables

Scatter Plots: A visual tool where coordinate pairs of two numeric variables ( $x$ and $y$ ) are plotted to identify trends (e.g., positive linear trends).
Covariance: A measure of the linear relationship between two numeric variables, specifically how they vary together. * Sample Covariance Notation: $\text{Cov}(x, y)$ or $\sigma_{xy}$ . * Theoretic Formula: $\sigma_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n - 1}$ * Computational Formula: $\frac{\sum x_iy_i - \frac{1}{n} (\sum x_i)(\sum y_i)}{n - 1}$
Interpreting the Sign of Covariance: * Positive (Cov > 0): Positive linear relationship; $x$ and $y$ move in the same direction. * Negative (Cov < 0): Negative linear relationship; $x$ and $y$ move in opposite directions. * Zero (Cov = 0): No linear relationship; variables are independent. Graphically, this could be a horizontal line ( $y$ is independent of $x$ ) or random scattered points with no discernible shape.
Flaw of Covariance: It is difficult to interpret because it is not bounded. The magnitude depends on the units and scales of the data, making it impossible to determine the "strength" of a relationship from covariance alone.

Coefficient of Correlation

The Coefficient of Correlation measures both the direction and the strength of a linear relationship.
Sample Correlation coefficient ( $r$ ): $r = \frac{\text{Cov}(x, y)}{s_x \times s_y}$
Population Correlation coefficient: Represented by the Greek symbol $\rho$ (rho).
Properties: * Bounded between $-1$ and $1$ . * $-1$ : Perfect negative correlation. * $0$ : No linear correlation. * $1$ : Perfect positive correlation.
Correlation Strength Examples: * $-0.8$ : Strong negative correlation. * $-0.3$ : Moderate negative correlation. * $0.7$ : Strong positive correlation. * $0.4$ : Moderate positive correlation.
Real-world Visualization: * Perfect Relationship: Points form a straight line. There is a fixed way $y$ changes via $x$ . * Strong (Not Perfect) Relationship: There is a discernible line through the data, but with some "spread" caused by other factors or randomness. * Moderate Relationship: A line is still discernible, but with a significant spread of points away from the line. * Zero Correlation: $x$ changes value, but $y$ remains independent.

Correlation versus Causation

No causal effect is implied by correlation: "Correlation does not equal causation."
Ice Cream and Shark Attack Example: A study might show high correlation between ice cream sales at the beach and shark attacks. This does not mean eating ice cream causes attacks. Instead, both are united by a hidden variable: the number of people present at the beach.

Professional Ethics and Objective Analysis

Data analysis must remain objective and neutral.
Proper Reporting: Summary measures must communicate important aspects accurately. For example, if data is skewed, reporting only the mean is misleading; both the mean and median should be reported to highlight the skewness.
Avoiding Subjectivity: Avoid emotional language in reporting (e.g., use "sales are trending down" instead of calling it a "disastrous crisis").
Ethical Constraints: * Report both good and bad results. * Do not handpick data or summary measures to distort facts. * Remain fair, objective, and neutral.

Questions & Discussion

Q: Why did you use a probability statement to explain frequency?
A: In this module, the frequency of occurrence is related explicitly to the probability of occurrence. If $95\%$ of data falls within an interval, there is a $95\%$ chance (probability) that a value will be contained in that interval.
Q: Regarding the Chebyshev bound calculation, why multiply $k$ by standard deviation?
A: The rule specifies "within $k$ standard deviations of the mean," which is represented mathematically as $k \times \sigma$ . For the example with $\sigma = 5$ and $k = 4.47$ , the term becomes $4.47 \times 5$ .