Bio207/208

1. Introduction to Biostatistics

Biostatistics is the application of statistical methods to the design, analysis, and interpretation of data in biology, medicine, and public health. It provides the tools necessary to manage the uncertainty and variability inherent in biological systems.

2. Descriptive Statistics

Descriptive statistics are used to summarize and describe the primary features of a dataset.

Measures of Central Tendency:
- Mean ( $\bar{x}$ ): The mathematical average of a set of values, calculated as $\bar{x} = \frac{\sum<em>{i=1}^{n} x</em>i}{n}$ .
- Median: The middle value in a dataset when ordered from least to greatest.
- Mode: The most frequently occurring value in the dataset.
Measures of Dispersion:
- Variance ( $\sigma^2$ ): Measures how far each number in the set is from the mean.
- Standard Deviation ( $\sigma$ ): The square root of the variance ( $\sigma = \sqrt{\sigma^2}$ ), indicating the average spread of data around the mean.
- Interquartile Range (IQR): The difference between the 75th percentile ( $Q3$ ) and the 25th percentile ( $Q1$ ).

3. Probability Distributions

Probability distributions describe how the values of a variable are distributed.

Normal Distribution: A symmetric, bell-shaped distribution defined by the mean ( $\mu$ ) and standard deviation ( $\sigma$ ).
Standard Normal Distribution: A special case of the normal distribution where $\mu = 0$ and $\sigma = 1$ . It uses Z-scores to represent the number of standard deviations an observation is from the mean:
$Z = \frac{x - \mu}{\sigma}$
Binomial Distribution: Used for discrete data where there are only two possible outcomes (e.g., success/failure, dead/alive).

4. Inferential Statistics

Inferential statistics allow researchers to make generalizations about a population based on a sample.

Hypothesis Testing:
- Null Hypothesis ( $H_0$ ): The statement that there is no effect or no difference.
- Alternative Hypothesis ( $H_a$ ): The statement that there is a significant effect or difference.
P-Value: The probability of obtaining the observed results (or more extreme) assuming the null hypothesis is true. A threshold of p < 0.05 is commonly used for statistical significance.
Confidence Intervals (CI): A range of values that likely contains the true population parameter. A 95% CI means that if the study were repeated many times, 95% of the calculated intervals would contain the true parameter.

5. Common Study Designs

Observational Studies:
- Cross-sectional: Analyzes data from a population at a single point in time.
- Case-Control: Compares individuals with a specific condition (cases) to those without (controls) to find risk factors.
- Cohort: Follows a group of people over time to see how exposures affect the development of outcomes.
Experimental Studies:
- Randomized Controlled Trials (RCTs): Participants are randomly assigned to either an experimental or control group to evaluate the efficacy of a treatment.

6. Correlation and Regression

Correlation Coefficient ( $r$ ): Measures the strength and direction of a linear relationship between two continuous variables, ranging from $-1$ to $1$ .
Simple Linear Regression: A model that describes the relationship between a dependent variable ( $Y$ ) and an independent variable ( $X$ ): $Y = \beta<em>0 + \beta</em>1 X + \epsilon$
- $\beta_0$ : Intercept
- $\beta_1$ : Slope (the change in $Y$ for every one-unit change in $X$ )
- $\epsilon$ : Error term