Statistical Reasoning Lecture 1

Introduction

Presenter: John McGready, Ph.D.
Institution: Johns Hopkins University, Bloomberg School of Public Health

Need for Biostatistics

Importance of biostatistics highlighted by several prominent figures:
- Hal Varian, Chief Economist at Google, stated in 2009:
  > “I keep saying that the sexy job in the next 10 years will be statisticians.”
- Harvard Business Review (2012) dubbed Data Scientist as the Sexiest Job of the 21st Century.
- New York Times (2009) emphasized statistics as a crucial skill for graduates.

Employment Trends in Statistics

Forbes (2019) ranked Data Scientist as the leading job in America according to Glassdoor.
Money magazine listed the top 100 jobs in 2021, emphasizing strong demand for statistics-related roles.

Steps in a Research Project

Major steps include:
- Planning/Design of Study
- Data Collection
- Data Analysis
- Presentation
- Interpretation
Statistics play a role in multiple steps, often concentrated in the data analysis phase.

The Ubiquity of Data

Data sources:
- Elmo and Apple study: Children picked twice as many apples with Elmo stickers (Cornell University study on children’s preferences).
- STD testing in DC High Schools: Noted that 13% of ~3,000 tested students were positive for STDs, mostly gonorrhea and chlamydia.
- Web-based counseling reduces blood pressure: Participants in a web-based lifestyle counseling group had a larger reduction in systolic blood pressure (10 mmHg) compared to control (6 mmHg).
- Vaccine Efficacy: A vaccine with 95% efficacy doesn't imply a 5% failure rate; complex statistical interpretations needed.

Role of Statistics in Research

Components of the research process include:
- Planning/Design:
- Identify primary questions:
  - Is it about quantifying a single group?
  - Is it comparing multiple groups?
- Determine sample size:
  - Total subjects?
  - Distribution across groups?
- Selecting participants:
  - Random selection versus convenience sampling.
  - Assignment group decisions for comparisons.
- Data Collection and Analysis:
- Summarization of raw data.
- Address variability obscuring patterns.
- Inference: Utilizing study info to make population statements.
- Presentation and Interpretation:
- What measures convey main messages effectively.
- Clarifying uncertainty and deriving practical meaning from results.

Course Goals

Overview of Skills:
- Term 1 Goals:
- Summarization
- Measurement of Associations
- Interval Estimation and Statistical Inference
- Sample Size Considerations in Study Design
- Term 2 Goals:
- Adjustment Techniques
- Assessing Effect Modification (Statistical Interactions)
- Understanding various regression techniques: linear, logistic, and time-to-event.

Universal Goals in Statistics

Focus on:
- Correct interpretation of statistical results.
- Summarizing published study results clearly.
- Evaluating strengths and weaknesses in published research regarding:
- Study design clarity
- Research questions
- Appropriateness of statistical methods
- Clarity of results reported
- Overall scientific conclusions.

Defining Populations and Samples

Population: Entire group for which data is sought.
- Example: All 18-year-old male college students in the U.S.
Sample: Subset of the population used for data collection.
- Example: 25 18-year-old male college students in the U.S.
Characteristics of a random sample should ideally reflect the overall population, although this alignment is not always achievable.

Random Sampling

Optimal representative sampling method, though not always feasible.
Defined as a method where each possible subset of a given size (n) has an equal chance of selection.

Comparative Analysis: Population Versus Sample

Research focuses on estimating population truths using imperfect sample data.

Examples of Sample vs. Population

Pulmonary health research example: 113 men sampled and blood pressure measured.
Maternal HIV transmission study: Observed 183 births to HIV+ women, 22% transmission rate obtained.
Geographic lung cancer study: Used data from a single year for a selected U.S. state.

Non-Random Sample Types

Non-random sampling may introduce biases as certain demographics may not be represented. Examples:
- Voting behaviors among potential voters (not registered).
- Specific disease groups like intravenous drug users.
- Homeless populations.

Implications of Non-Random Sampling

Such sampling may not accurately represent population characteristics, potentially skewing findings and interpretations.

Comparison of Study Designs

Learning objectives cover descriptions and distinctions between randomized cohort, observational cohort, and case-control designs.
Understanding the analytical challenges of unrandomized comparisons.

Common Study Design Types

Prospective Cohort Studies:
- Randomized and controlled design where subjects are classified based on exposure status for follow-up comparisons.
Observational Cohort Studies:
- Subjects selected based on exposure, followed to see outcomes.
Case-Control Studies:
- Subjects selected based on outcome status followed by assessments of prior exposure.

Importance of Randomization in Experiments

Guarantees systematic similarities aside from exposure, mitigating biases.
- Landmark study: The Salk Polio Vaccine trial involved over 200,000 subjects—results adjusted for accuracy.

Randomization Limitations

Not always possible in practical scenarios, particularly with sensitive population discussions (like smokers) when health risks are concerned.

Analyzing Observational Studies

Subject self-selection in exposures adds biases, complicating clear conclusions.
- Example: Correlation between smoking and alcohol use affects outcome assessment.

Example of Observational Cohort Study: Needle Exchange Programs

Relates relative risk of HIV infection to program participation while adjusting for demographic variances.

Example of HPV Vaccination Study

Gender-based outcomes from vaccination studied, findings adjusted for health-seeking behaviors and demographics.

Case-Control Studies

Alternative to cohort studies for analyzing rare outcomes efficiently, such as associations of exposure to lung cancer.

Challenges with Case-Control Studies

Confounding factors and recall bias impact reliability, emphasizing the importance of controlling analyses for potential distortions.

Summary of Study Type Differences

Addressing issues with non-randomized studies across types—implications for public health research.

Types of Data in Research

Learning objectives focus on categorizing data types effectively.
- Continuous Data: Measurement types that can take on an infinite number of values (e.g., blood pressure).
- Binary Data: Takes two values (e.g., yes/no).
- Categorical Data: Extends binary to more values, further split into nominal and ordinal categories.
- Time-to-Event Data: Captures the timing of an event and its occurrence (e.g., time to relapse).

Data Analysis Considerations

Use appropriate tools for different data types for robust statistical assessments.
Various comparison techniques are employed based on data format:
- Continuous: Utilize mean differences and statistical tests like t-tests.
- Binary/Categorical: Comparison using proportion differences and chi-squared tests.
- Time-to-Event: Incidence rate ratios and Kaplan-Meier curves for survival analysis.

Summary of Analyzed Data Types

Three key types in exploration:
- Continuous, Binary/Categorical, Time-to-Event
- Different methodologies critical for summarization and analysis.