Foundations of Biostatistics and Epidemiology — Study Notes (Bullet Points)
Prevalence
Definition: Prevalence measures the proportion of individuals in a population who have a disease (or condition) at a specific point in time or over a specified period.
Types:
Point Prevalence: At a specific moment in time
Period Prevalence: Over a defined time period (e.g., past year)
Formula: Prevalence=Total population at the same timeNumber of existing cases of disease×100
Explanation: Can be expressed as a percentage, indicating the burden of disease in a population.
Incidence
Definition: Incidence refers to the number of new cases of a disease that develop in a population at risk during a specified time period.
Two main types:
A. Cumulative Incidence (CI) (Also called Incidence Proportion)
Definition: Measures the proportion of initially disease-free people who develop the disease over a period of time.
Formula: CI=Population at risk at the beginning of the periodNumber of new cases during a time period
Explanation: Assumes the population is stable (no major loss to follow-up) and represents the average risk of developing the disease over the specified period.
Connection Between Incidence and Prevalence
Explanation: Prevalence (the proportion with the disease at a given time) is influenced by two key factors:
1) Incidence - how often new cases occur
2) Duration - how long people remain affected by the disease
If a disease has a high incidence but people recover quickly, prevalence may be low.
If a disease has a low incidence but people live with it for a long time, prevalence can be high because existing cases accumulate.
Conceptual flow:
Incidence (\rightarrow) New cases; Duration (\rightarrow) length of time cases remain; Recovery/Death remove cases from prevalence.
Examples: Incidence vs Prevalence
Flu (Influenza):
Incidence: High (\rightarrow) many new cases each flu season
Duration: Short (\rightarrow) most recover in a week or two
Result: Prevalence remains low at any given time because recovery is rapid.
HIV/AIDS:
Incidence: Relatively low (\rightarrow) fewer new infections per year
Duration: Long (\rightarrow) people live with HIV for many years due to treatment
Result: Prevalence is high because people stay HIV-positive for long periods.
Risk Difference
Term: Risk Difference (RD)
Definition: RD is an absolute measure of association between exposure and outcome, indicating the excess risk of disease attributable to the exposure.
General formula: RD=Measure<em>Exposed−Measure</em>Unexposed
Specific forms using different measures:
Using Prevalence Proportion (PP): RD<em>PP=PP</em>exposed−PPunexposed
Using Cumulative Incidence (CI): RD<em>CI=CI</em>exposed−CIunexposed
Using Incidence Rate (IR): RD<em>IR=IR</em>exposed−IRunexposed
Attributable Risk
Term: Population Attributable Risk (PAR)
Definition: The PAR is a difference measure that quantifies how much of the disease burden in the entire population can be attributed to a specific risk factor.
Explanation: It reflects the proportion of disease that could be prevented if the risk factor were eliminated from the population.
Formula (conceptual): PAR=I<em>pop−I</em>unexposed
where (I) denotes the incidence (or incidence rate) in the population and in the unexposed group.
PAR% (proportion of disease in the population that could be prevented): PAR%=IpopI<em>pop−I</em>unexposed×100
Relative Risk, Odds & Rate Ratios
Relative Risk (RR):
Definition: Compares the probability (risk) of disease in the exposed group to that in the unexposed group.
Formula: RR=P(D∣¬E)P(D∣E)=Risk in unexposedRisk in exposed
Odds Ratio (OR):
Definition: Ratio of the odds of disease in the exposed group to the odds in the unexposed group.
Odds in exposed: ( \frac{P(D|E)}{1 - P(D|E)} )
Odds in unexposed: ( \frac{P(D|\neg E)}{1 - P(D|\neg E)} )
Formula: OR=1−P(D∣¬E)P(D∣¬E)1−P(D∣E)P(D∣E)
Rate Ratio (Incidence Rate Ratio, IRR):
Definition: Ratio of incidence rates (person-time rates) in the exposed and unexposed groups.
Formula: IRR=IR</em>unexposedIR<em>exposed
Descriptive Statistics
Definition: Descriptive statistics summarize, organize, and interpret data.
Key components:
Measures of Central Tendency: Mean, Median, Mode, Geometric Mean
Measures of Variability (Spread): Range, Variance, Standard Deviation
Data Visualization: Bar Graphs, Histograms, Box Plots
Measures of Location
Arithmetic Mean (Mean):
Definition: The most common measure of central tendency but sensitive to outliers.
Formula: xˉ=n1∑<em>i=1nx</em>i
Median:
Definition: The middle value when data are ordered; robust to extreme values.
Procedure:
Arrange data in ascending order
If (n) is odd, the median is the middle value
If (n) is even, the median is the average of the two middle values
Example: With five scores (80, 85, 90, 95, 100), median = 90.
Mode:
Definition: Most frequently occurring value; particularly useful for categorical (nominal or ordinal) data where means and medians are not appropriate.
Explanation: If no repeated values (as in 80, 85, 90, 95, 100), there is no mode.
Geometric Mean:
Definition: Used for percentages and ratios, especially for data that compounds over time (e.g., growth rates).
Explanation: For growth rates, convert to factors: ( ai = 1 + ri ).
Formula: GM=(∏<em>i=1na</em>i)1/n
Explanation: If expressing as a rate, subtract 1: ( \text{GM}_{\text{rate}} = \text{GM} - 1 ).
Explanation: The geometric mean is appropriate for percentages or ratios because it accounts for compounding effects. It avoids distortion when data vary significantly; it downweights extreme values compared to the arithmetic mean. For example, the arithmetic mean of growth rates 10%, 15%, 20% is 15%, which overestimates the true average due to compounding; GM gives a lower, often more accurate rate ((\approx) 14.4–14.9%).
Measures of Spread
Range:
Definition: Difference between the largest and smallest values in a dataset.
Formula: Range=max(x<em>i)−min(x</em>i)
Explanation: Sensitive to outliers and sample size, meaning it can be heavily influenced by extreme values and doesn't use all data points.
Quantiles:
Definition: Divide data into equal parts, providing insight into the distribution of values.
Types:
Quartiles: four equal parts (each 25%)
Quintiles: five equal parts (each 20%)
Deciles: ten equal parts (each 10%)
Percentiles: one hundred equal parts (each 1%)
Range (Example)
Dataset: 65, 70, 75, 80, 85, 90, 95
Calculation: Largest = 95, Smallest = 65
Result: Range = 95 - 65 = 30
Quantiles (Example)
Explanation: A set of 20 ordered scores: 50, 52, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89. Quantiles partition the data into 4, 5, 10, or 100 equal parts as needed to understand the data's distribution.
Variance and Standard Deviation
Variance (s^2):
Definition: Measures how far data points are spread from the mean, indicating the average of the squared differences from the mean.
Formula: s2=n−11∑<em>i=1n(x</em>i−xˉ)2
Units: Square of the original data's units, which can make interpretation difficult.
Standard Deviation (s):
Definition: Square root of variance; measures spread in the same units as the data, making it more interpretable than variance.
Formula: s=s2
Calculating Variance & Standard Deviation
Steps to compute Sample Standard Deviation (s):
1) Find the Sample Mean: ((\bar{x}))
2) Compute Each Deviation: ((x_i - \bar{x}))
3) Square Each Deviation
4) Sum the Squared Deviations
5) Divide by ((n - 1)) (for sample variance)
6) Take the Square Root (to get standard deviation)
Visualizing Data
Box-and-Whisker Plot (Box Plot):
Description: Shows distribution via a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
Purpose: Useful for identifying the shape of the distribution, skewness, and potential outliers (e.g., Box Plot of Cholesterol Levels).
Descriptive Stats: Examples from the Transcript
Ex 2.1 (Cancer, Nutrition):
Objective: Describe vitamin A consumption among cases (cancer patients) and controls before formal analysis.
Setup: 200 hospitalized cancer patients (cases) and 200 matched controls.
Matching: Controls matched on age and sex; controls hospitalized for an unrelated disease.
Ex 2.2 (Pulmonary Disease):
Question: Do passive smokers have impaired pulmonary function compared with nonsmokers in smoky environments?
Context: San Diego study, 1980.
Summary: Passive smokers had lower pulmonary function compared with comparable nonsmokers in smoky environments.
Ex 2.3 (Graphs):
Explanation: A scatter plot for CO concentrations in two working environments shows divergence during the day and convergence after the workday, illustrating relationships between variables over time.
Practical Data Visualization / Examples
Explanation: Descriptive statistics in case-control settings require describing the data first (Ex 2.1). Visuals such as bar graphs, histograms, scatter plots, and box plots aid interpretation of distributions and relationships (Ex 2.3).
Box Plot, Box Plot Example
Summary: Box plots summarize Cholesterol Levels with a five-number summary (minimum, Q1, median, Q3, maximum) and visually identify potential outliers.