Foundations of Biostatistics and Epidemiology — Study Notes (Bullet Points)
Prevalence
- Definition: Prevalence measures the proportion of individuals in a population who have a disease (or condition) at a specific point in time or over a specified period.
- Types:
- Point Prevalence: At a specific moment in time
- Period Prevalence: Over a defined time period (e.g., past year)
- Formula:
\text{Prevalence} = \frac{\text{Number of existing cases of disease}}{\text{Total population at the same time}} \times 100 - Explanation: Can be expressed as a percentage, indicating the burden of disease in a population.
Incidence
- Definition: Incidence refers to the number of new cases of a disease that develop in a population at risk during a specified time period.
- Two main types:
- A. Cumulative Incidence (CI) (Also called Incidence Proportion)
- Definition: Measures the proportion of initially disease-free people who develop the disease over a period of time.
- Formula:
\text{CI} = \frac{\text{Number of new cases during a time period}}{\text{Population at risk at the beginning of the period}} - Explanation: Assumes the population is stable (no major loss to follow-up) and represents the average risk of developing the disease over the specified period.
Connection Between Incidence and Prevalence
- Explanation: Prevalence (the proportion with the disease at a given time) is influenced by two key factors:
1) Incidence - how often new cases occur
2) Duration - how long people remain affected by the disease - If a disease has a high incidence but people recover quickly, prevalence may be low.
- If a disease has a low incidence but people live with it for a long time, prevalence can be high because existing cases accumulate.
- Conceptual flow:
Incidence (\rightarrow) New cases; Duration (\rightarrow) length of time cases remain; Recovery/Death remove cases from prevalence.
Examples: Incidence vs Prevalence
- Flu (Influenza):
- Incidence: High (\rightarrow) many new cases each flu season
- Duration: Short (\rightarrow) most recover in a week or two
- Result: Prevalence remains low at any given time because recovery is rapid.
- HIV/AIDS:
- Incidence: Relatively low (\rightarrow) fewer new infections per year
- Duration: Long (\rightarrow) people live with HIV for many years due to treatment
- Result: Prevalence is high because people stay HIV-positive for long periods.
Risk Difference
- Term: Risk Difference (RD)
- Definition: RD is an absolute measure of association between exposure and outcome, indicating the excess risk of disease attributable to the exposure.
- General formula:
\text{RD} = \text{Measure}{\text{Exposed}} - \text{Measure}{\text{Unexposed}} - Specific forms using different measures:
- Using Prevalence Proportion (PP):
\text{RD}{\text{PP}} = \text{PP}{\text{exposed}} - \text{PP}_{\text{unexposed}} - Using Cumulative Incidence (CI):
\text{RD}{\text{CI}} = \text{CI}{\text{exposed}} - \text{CI}_{\text{unexposed}} - Using Incidence Rate (IR):
\text{RD}{\text{IR}} = \text{IR}{\text{exposed}} - \text{IR}_{\text{unexposed}}
Attributable Risk
- Term: Population Attributable Risk (PAR)
- Definition: The PAR is a difference measure that quantifies how much of the disease burden in the entire population can be attributed to a specific risk factor.
- Explanation: It reflects the proportion of disease that could be prevented if the risk factor were eliminated from the population.
- Formula (conceptual):
\text{PAR} = I{\text{pop}} - I{\text{unexposed}}
where (I) denotes the incidence (or incidence rate) in the population and in the unexposed group. - PAR% (proportion of disease in the population that could be prevented):
\text{PAR\%} = \frac{I{\text{pop}} - I{\text{unexposed}}}{I_{\text{pop}}} \times 100
Relative Risk, Odds & Rate Ratios
- Relative Risk (RR):
- Definition: Compares the probability (risk) of disease in the exposed group to that in the unexposed group.
- Formula:
\text{RR} = \frac{P(D|E)}{P(D|\neg E)} = \frac{\text{Risk in exposed}}{\text{Risk in unexposed}}
- Odds Ratio (OR):
- Definition: Ratio of the odds of disease in the exposed group to the odds in the unexposed group.
- Odds in exposed: ( \frac{P(D|E)}{1 - P(D|E)} )
- Odds in unexposed: ( \frac{P(D|\neg E)}{1 - P(D|\neg E)} )
- Formula:
\text{OR} = \frac{ \frac{P(D|E)}{1 - P(D|E)} }{ \frac{P(D|\neg E)}{1 - P(D|\neg E)} }
- Rate Ratio (Incidence Rate Ratio, IRR):
- Definition: Ratio of incidence rates (person-time rates) in the exposed and unexposed groups.
- Formula:
\text{IRR} = \frac{IR{\text{exposed}}}{IR{\text{unexposed}}}
Descriptive Statistics
- Definition: Descriptive statistics summarize, organize, and interpret data.
- Key components:
- Measures of Central Tendency: Mean, Median, Mode, Geometric Mean
- Measures of Variability (Spread): Range, Variance, Standard Deviation
- Data Visualization: Bar Graphs, Histograms, Box Plots
Measures of Location
- Arithmetic Mean (Mean):
- Definition: The most common measure of central tendency but sensitive to outliers.
- Formula:
\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi
- Median:
- Definition: The middle value when data are ordered; robust to extreme values.
- Procedure:
- Arrange data in ascending order
- If (n) is odd, the median is the middle value
- If (n) is even, the median is the average of the two middle values
- Example: With five scores (80, 85, 90, 95, 100), median = 90.
- Mode:
- Definition: Most frequently occurring value; particularly useful for categorical (nominal or ordinal) data where means and medians are not appropriate.
- Explanation: If no repeated values (as in 80, 85, 90, 95, 100), there is no mode.
- Geometric Mean:
- Definition: Used for percentages and ratios, especially for data that compounds over time (e.g., growth rates).
- Explanation: For growth rates, convert to factors: ( ai = 1 + ri ).
- Formula:
\text{GM} = \left( \prod{i=1}^{n} ai \right)^{1/n} - Explanation: If expressing as a rate, subtract 1: ( \text{GM}_{\text{rate}} = \text{GM} - 1 ).
- Example: Growth rates 10%, 15%, 20% (\rightarrow) factors 1.10, 1.15, 1.20
\text{GM} = \left( 1.10 \cdot 1.15 \cdot 1.20 \right)^{1/3} \approx 1.149 \Rightarrow \text{GM}_{\text{rate}} \approx 14.9\%
Why Use the Geometric Mean?
- Explanation: The geometric mean is appropriate for percentages or ratios because it accounts for compounding effects. It avoids distortion when data vary significantly; it downweights extreme values compared to the arithmetic mean. For example, the arithmetic mean of growth rates 10%, 15%, 20% is 15%, which overestimates the true average due to compounding; GM gives a lower, often more accurate rate ((\approx) 14.4–14.9%).
Measures of Spread
- Range:
- Definition: Difference between the largest and smallest values in a dataset.
- Formula: \text{Range} = \max(xi) - \min(xi)
- Explanation: Sensitive to outliers and sample size, meaning it can be heavily influenced by extreme values and doesn't use all data points.
- Quantiles:
- Definition: Divide data into equal parts, providing insight into the distribution of values.
- Types:
- Quartiles: four equal parts (each 25%)
- Quintiles: five equal parts (each 20%)
- Deciles: ten equal parts (each 10%)
- Percentiles: one hundred equal parts (each 1%)
Range (Example)
- Dataset: 65, 70, 75, 80, 85, 90, 95
- Calculation: Largest = 95, Smallest = 65
- Result: Range = 95 - 65 = 30
Quantiles (Example)
- Explanation: A set of 20 ordered scores: 50, 52, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89. Quantiles partition the data into 4, 5, 10, or 100 equal parts as needed to understand the data's distribution.
Variance and Standard Deviation
- Variance (s^2):
- Definition: Measures how far data points are spread from the mean, indicating the average of the squared differences from the mean.
- Formula:
s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2 - Units: Square of the original data's units, which can make interpretation difficult.
- Standard Deviation (s):
- Definition: Square root of variance; measures spread in the same units as the data, making it more interpretable than variance.
- Formula:
s = \sqrt{s^2}
Calculating Variance & Standard Deviation
- Steps to compute Sample Standard Deviation (s):
1) Find the Sample Mean: ((\bar{x}))
2) Compute Each Deviation: ((x_i - \bar{x}))
3) Square Each Deviation
4) Sum the Squared Deviations
5) Divide by ((n - 1)) (for sample variance)
6) Take the Square Root (to get standard deviation)
Visualizing Data
- Box-and-Whisker Plot (Box Plot):
- Description: Shows distribution via a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
- Purpose: Useful for identifying the shape of the distribution, skewness, and potential outliers (e.g., Box Plot of Cholesterol Levels).
Descriptive Stats: Examples from the Transcript
- Ex 2.1 (Cancer, Nutrition):
- Objective: Describe vitamin A consumption among cases (cancer patients) and controls before formal analysis.
- Setup: 200 hospitalized cancer patients (cases) and 200 matched controls.
- Matching: Controls matched on age and sex; controls hospitalized for an unrelated disease.
- Ex 2.2 (Pulmonary Disease):
- Question: Do passive smokers have impaired pulmonary function compared with nonsmokers in smoky environments?
- Context: San Diego study, 1980.
- Summary: Passive smokers had lower pulmonary function compared with comparable nonsmokers in smoky environments.
- Ex 2.3 (Graphs):
- Explanation: A scatter plot for CO concentrations in two working environments shows divergence during the day and convergence after the workday, illustrating relationships between variables over time.
Practical Data Visualization / Examples
- Explanation: Descriptive statistics in case-control settings require describing the data first (Ex 2.1). Visuals such as bar graphs, histograms, scatter plots, and box plots aid interpretation of distributions and relationships (Ex 2.3).
Box Plot, Box Plot Example
- Summary: Box plots summarize Cholesterol Levels with a five-number summary (minimum, Q1, median, Q3, maximum) and visually identify potential outliers.