Chapter 4_ Reasoning with Data

CSDS 313: INTRODUCTION TO DATA ANALYSIS

1. What Can Go Wrong in Data Analysis?

  • Data Collection Control: Limited control over how data is collected can lead to issues.

    • Observational Inference: Deductions made from observations can mislead.

    • Association vs. Cause and Effect: Distinction between correlation and causation is vital.

    • Bias: Systematic biases can skew results intentionally or unintentionally.

    • Confounders: Overlooking variables that influence results can lead to incorrect conclusions.

  • Data Torture: The idea that manipulating data can make it appear to support any hypothesis.

    • Multiple Hypothesis Testing: Increases the risk of finding significant results by chance.

    • Overfitting: Tailoring a model too closely to a particular data set may reduce generalizability.

  • Visualization Issues: Visual representations of data can mislead the interpretation.

2. Our Objective

  • Recognizing Traps: Identify statistical traps, misleading visualizations, and patterns resulting from bias.

    • Effort to Debunk: Challenging misleading data interpretations requires significantly more effort than making them.

3. Simpson’s Paradox

  • Definition: A situation where a trend appears in different groups of data but reverses when the groups are combined.

    • Example: Berkeley admissions case which revealed biases based on gender.

    • Exercise vs. Disease Correlation: Analysis shows different correlations based on age groups.

4. Berkeley Admissions Case (1970s)

  • Details: UC Berkeley faced allegations of gender discrimination in graduate admissions.

    • Statistics: 44% of male applicants were admitted, compared to 35% of female applicants.

    • Stratified Analysis: Showed that biases in some departments masked an overall trend favoring women.

5. Other Cases for Analysis

  • Examples: Various cases needing deeper scrutiny, including admissions for international/domestic by geographical bias.

6. Kidney Stone Example (Simpson’s Paradox)

  • Analysis: Treatment A shows effectiveness on both small and large stones, yet Treatment B appears overall more effective due to a majority of lesser applicable cases.

7. Correlation Studies

  • Investigating correlations between exercise and disease probability needs to factor in age stratifications.

8. Will Rogers Phenomenon

  • Explanation: A shift in groups can misleadingly inflate average scores of both sets.

    • Medically Relevant: Stage migration in patient groups can lead to incorrect conclusions about treatment efficacy.

9. Base Rate Fallacy

  • Concept: Ignoring generic prevalence data for specific instances can skew interpretations.

    • Disease Example: Testing in a high-incidence population vs a low-incidence population illustrates significant differences in prediction accuracy.

  • Application: In cases of drunk driving tests, the probability of actual drunkenness is vastly lower than one might infer from test results alone.

10. Survivorship Bias

  • Definition: Failure to consider non-observed elements can lead to overly optimistic analyses of success.

    • Examples: Real-world implications in economics, academia, and healthcare.

11. Data Visualization Concerns

  • Guidelines: Specific guidelines exist for visualizing data correctly to avoid misleading representations.

    • Axes and Reference Points: Important for accurately depicting trends, especially in categorical vs. quantitative data.

12. Misleading Axes

  • Often graphs omit zero or use inconsistent scales that distort meaningful comparisons.

13. Multiple Y Axes

  • Claim: Correlation analysis can be manipulated using various scales to suggest misleading claims.

  • If you change the scales of the two Y axes, you can draw any conclusion you want.

    • Solution: Advocating for side-by-side comparisons or indexed charts to reduce confusion.

    • Solution: Indexed charts

    • Solution: Connected Scattered plots

14. Granularity-Related Inconsistency of Means (GRIM) Test

  • Evaluation of statistical concepts needs to consider underlying data issues and misinterpretations in presentations.

15. Conclusion

  • Skills Enhancement: Developing a keen eye for statistical traps, biases, and the importance of accurate visualizations is crucial for data analysis credibility.

robot