Statistical Reasoning Lecture 1
Introduction
Presenter: John McGready, Ph.D.
Institution: Johns Hopkins University, Bloomberg School of Public Health
Need for Biostatistics
Importance of biostatistics highlighted by several prominent figures:
Hal Varian, Chief Economist at Google, stated in 2009:
> “I keep saying that the sexy job in the next 10 years will be statisticians.”Harvard Business Review (2012) dubbed Data Scientist as the Sexiest Job of the 21st Century.
New York Times (2009) emphasized statistics as a crucial skill for graduates.
Employment Trends in Statistics
Forbes (2019) ranked Data Scientist as the leading job in America according to Glassdoor.
Money magazine listed the top 100 jobs in 2021, emphasizing strong demand for statistics-related roles.
Steps in a Research Project
Major steps include:
Planning/Design of Study
Data Collection
Data Analysis
Presentation
Interpretation
Statistics play a role in multiple steps, often concentrated in the data analysis phase.
The Ubiquity of Data
Data sources:
Elmo and Apple study: Children picked twice as many apples with Elmo stickers (Cornell University study on children’s preferences).
STD testing in DC High Schools: Noted that 13% of ~3,000 tested students were positive for STDs, mostly gonorrhea and chlamydia.
Web-based counseling reduces blood pressure: Participants in a web-based lifestyle counseling group had a larger reduction in systolic blood pressure (10 mmHg) compared to control (6 mmHg).
Vaccine Efficacy: A vaccine with 95% efficacy doesn't imply a 5% failure rate; complex statistical interpretations needed.
Role of Statistics in Research
Components of the research process include:
Planning/Design:
Identify primary questions:
Is it about quantifying a single group?
Is it comparing multiple groups?
Determine sample size:
Total subjects?
Distribution across groups?
Selecting participants:
Random selection versus convenience sampling.
Assignment group decisions for comparisons.
Data Collection and Analysis:
Summarization of raw data.
Address variability obscuring patterns.
Inference: Utilizing study info to make population statements.
Presentation and Interpretation:
What measures convey main messages effectively.
Clarifying uncertainty and deriving practical meaning from results.
Course Goals
Overview of Skills:
Term 1 Goals:
Summarization
Measurement of Associations
Interval Estimation and Statistical Inference
Sample Size Considerations in Study Design
Term 2 Goals:
Adjustment Techniques
Assessing Effect Modification (Statistical Interactions)
Understanding various regression techniques: linear, logistic, and time-to-event.
Universal Goals in Statistics
Focus on:
Correct interpretation of statistical results.
Summarizing published study results clearly.
Evaluating strengths and weaknesses in published research regarding:
Study design clarity
Research questions
Appropriateness of statistical methods
Clarity of results reported
Overall scientific conclusions.
Defining Populations and Samples
Population: Entire group for which data is sought.
Example: All 18-year-old male college students in the U.S.
Sample: Subset of the population used for data collection.
Example: 25 18-year-old male college students in the U.S.
Characteristics of a random sample should ideally reflect the overall population, although this alignment is not always achievable.
Random Sampling
Optimal representative sampling method, though not always feasible.
Defined as a method where each possible subset of a given size (n) has an equal chance of selection.
Comparative Analysis: Population Versus Sample
Research focuses on estimating population truths using imperfect sample data.
Examples of Sample vs. Population
Pulmonary health research example: 113 men sampled and blood pressure measured.
Maternal HIV transmission study: Observed 183 births to HIV+ women, 22% transmission rate obtained.
Geographic lung cancer study: Used data from a single year for a selected U.S. state.
Non-Random Sample Types
Non-random sampling may introduce biases as certain demographics may not be represented. Examples:
Voting behaviors among potential voters (not registered).
Specific disease groups like intravenous drug users.
Homeless populations.
Implications of Non-Random Sampling
Such sampling may not accurately represent population characteristics, potentially skewing findings and interpretations.
Comparison of Study Designs
Learning objectives cover descriptions and distinctions between randomized cohort, observational cohort, and case-control designs.
Understanding the analytical challenges of unrandomized comparisons.
Common Study Design Types
Prospective Cohort Studies:
Randomized and controlled design where subjects are classified based on exposure status for follow-up comparisons.
Observational Cohort Studies:
Subjects selected based on exposure, followed to see outcomes.
Case-Control Studies:
Subjects selected based on outcome status followed by assessments of prior exposure.
Importance of Randomization in Experiments
Guarantees systematic similarities aside from exposure, mitigating biases.
Landmark study: The Salk Polio Vaccine trial involved over 200,000 subjects—results adjusted for accuracy.
Randomization Limitations
Not always possible in practical scenarios, particularly with sensitive population discussions (like smokers) when health risks are concerned.
Analyzing Observational Studies
Subject self-selection in exposures adds biases, complicating clear conclusions.
Example: Correlation between smoking and alcohol use affects outcome assessment.
Example of Observational Cohort Study: Needle Exchange Programs
Relates relative risk of HIV infection to program participation while adjusting for demographic variances.
Example of HPV Vaccination Study
Gender-based outcomes from vaccination studied, findings adjusted for health-seeking behaviors and demographics.
Case-Control Studies
Alternative to cohort studies for analyzing rare outcomes efficiently, such as associations of exposure to lung cancer.
Challenges with Case-Control Studies
Confounding factors and recall bias impact reliability, emphasizing the importance of controlling analyses for potential distortions.
Summary of Study Type Differences
Addressing issues with non-randomized studies across types—implications for public health research.
Types of Data in Research
Learning objectives focus on categorizing data types effectively.
Continuous Data: Measurement types that can take on an infinite number of values (e.g., blood pressure).
Binary Data: Takes two values (e.g., yes/no).
Categorical Data: Extends binary to more values, further split into nominal and ordinal categories.
Time-to-Event Data: Captures the timing of an event and its occurrence (e.g., time to relapse).
Data Analysis Considerations
Use appropriate tools for different data types for robust statistical assessments.
Various comparison techniques are employed based on data format:
Continuous: Utilize mean differences and statistical tests like t-tests.
Binary/Categorical: Comparison using proportion differences and chi-squared tests.
Time-to-Event: Incidence rate ratios and Kaplan-Meier curves for survival analysis.
Summary of Analyzed Data Types
Three key types in exploration:
Continuous, Binary/Categorical, Time-to-Event
Different methodologies critical for summarization and analysis.