Biostatistics and Epidemiology Notes

Biostatistics

Statistics is a branch of mathematics dealing with the collection, organization, presentation, analysis, and interpretation of data.
Two main areas:
- Mathematical statistics: Development of new statistical inference methods, requiring detailed knowledge of abstract mathematics.
- Applied statistics: Application of mathematical statistics methods to various subjects (e.g., economics, psychology, public health).

Branches of Statistics

Descriptive statistics: Methods for summarizing and presenting data in an easier-to-analyze and interpret form.
- Includes measures of central tendency (mean, median, mode) and measures of variability (range, variance, standard deviation).
Inferential statistics: Generalizing from samples to populations, performing estimations and hypothesis tests, and making predictions.
- Involves techniques like hypothesis testing (Z-test, T-test, ANOVA test, Wilcoxon Signed Rank Test, Mann-Whitney U Test), regression analysis (linear, nominal, logistic, ordinal).

Examples

Descriptive: "In 2011, there were 34 deaths from the avian flu." (Source: WHO)
Inferential: "In 2025, the world population is predicted to be 8 billion people." (Source: UN)
Inferential: "Based on a sample of 2739 respondents, it is estimated that all pet owners spent a total of 14 billion dollars on veterinarian care for their pets." (Source: American Pet Products Association, Pet Owners Survey)
Descriptive: "The median cost of 2739 respondents on veterinarian care for their pets is 12.5 billion dollars."
Inferential: "Scientists at the University of Oxford in England found that a good laugh significantly raises a person’s pain level tolerance."
Inferential: "Drinking decaffeinated coffee can raise cholesterol levels by 7%" (Source: American Heart Association)
Descriptive: "The average stay in a hospital for 2000 patients who had circulatory system problems was 4.7 days."
Descriptive: "In a poll of 3036 adults, 32% said that they got a flu shot at a retail clinic" (Source: Harris Interactive Poll)
Descriptive: "In the Philippines, there were 894 COVID cases reported on Sep 4-10, 15.0% higher than the 780 cases reported on Aug 28-Sep 3. (WHO)"
Inferential: "It is projected that the average total number of adults and children receiving ARV treatment will be 31.7 million people by 2024. (aidsdatahub.org)"

Classification of Statistics

Parametric statistics: Assumes random sample from a normal distribution and involves testing of hypotheses about the population mean.
- Assumptions include normality, homoscedasticity (equal variances), interval or ratio level measurement, and absence of outliers.
Nonparametric statistics: Does not assume any underlying data distribution and involves hypothesis testing about a population median.
- Less powerful than parametric tests, requiring slightly larger sample sizes.

Parametric vs. Non-Parametric Tests

Parametric Tests
- One Sample: z test, t test.
- Two Sample:
  - Independent Sample: t test or z test.
  - Paired Sample: paired t test.
Non-Parametric Tests
- One Sample: Chi-Square, K-S, Runs test, Binomial
- Two Sample:
  - Independent Sample: Chi-Square, Mann-Whitney, Median, K-S
  - Paired Sample: Sign, Wilcoxon, McNemar, Chi-square

Epidemiology

Epidemiology: Study of the distribution and determinants of health-related conditions in human populations and the application of this method to control health problems.
- Originally dealt with infectious diseases but now includes chronic diseases and health-related conditions (e.g., heart attacks, car accidents, autism, arthritis).
- Distribution: Refers to time, place, and types of persons affected.
- Determinants: Physical, biological, social, cultural, and behaviors that influence health.

Terminologies in Biostatistics

Variable: A characteristic or attribute that can assume different values (e.g., height, weight, BMI, blood pressure).
Data: Values (measurements or observations) that the variable can assume.
Data set: A collection of data values; each value is a datum or data point.
Population: All subjects (human or otherwise) being studied.
Sample: A group of subjects selected from a population.
Census: The process of collecting data from every subject in the population.
- In the Philippines, the Philippine Statistics Authority is responsible for national censuses and surveys.
Parameter: A characteristic or measure obtained by using all values from a specific population.
- Example: average income of all nurses in the Philippines
Statistic: A characteristic or measure obtained by using data values from a sample.
- Example: average income of a sample of 15 nurses

Measurements

Four levels of measurement (lowest to highest): nominal, ordinal, interval, and ratio.
- Nominal: Classifies data into mutually exclusive categories with no order or ranking (e.g., name, gender, zip code, student number, marital status).
- Ordinal: Classifies data into categories that can be ranked, but precise differences do not exist (e.g., rankings, performance evaluations).
- Interval: Ranks data with precise differences between units, but no meaningful zero exists (e.g., IQ score, temperature).
- Ratio: Possesses all characteristics of interval measurement and has a true zero (e.g., height, weight).

Levels of Measurement Table

Level	Mutually Exclusive Categories	Can be Ordered	Meaningful Differences	Can be Added/Subtracted	Meaningful Zero	Can be Multiplied/Divided
Nominal	✓
Ordinal	✓	✓
Interval	✓	✓	✓	✓
Ratio	✓	✓	✓	✓	✓	✓

Examples

Grade description (Passed, Failed, INC, UW) - Nominal (N)
Letter Grades (A, B, C, D, E, F) - Ordinal (O)
General Weighted Average (40.00-100.00) - Interval (I)
Contact Numbers - Nominal (N)
Time of the day in 24H format (00:00:00-23:59:59) - Interval (I)
Age group (infant, children, adolescent, adult) - Ordinal (O)
Age in years (0, 1, 2, 3, …) - Ratio (R)
Time in seconds to finish a task - Ratio (R)

Variables

Experimental Variables

Response variable (dependent variable): Affected by the value of some other variable.
Explanatory variable (independent variable): Thought to affect the values of the response variable.

Examples

Research Question: "Do tomatoes grow fastest under fluorescent, incandescent, or natural light?"
- Independent: Type of light the tomato plant is grown under
- Dependent: The rate of growth of the tomato plant
Research Question: "What is the effect of intermittent fasting on blood sugar levels?"
- Independent: Presence or absence of intermittent fasting
- Dependent: Blood sugar levels

Correlation Studies

Predictor variable: Same as the independent variable.
Criterion variable: Same as the dependent variable.
Correlational research does not identify independent and dependent variables, as analysis is not dependent on the direction of the relationship.

Examples

Research Question: "An economist wants to determine whether there is a linear relationship between a country’s gross domestic product (GDP) and carbon dioxide (CO2) emissions."
- Predictor: Gross Domestic Product
- Criterion: Carbon dioxide (CO2) emissions
Research Question: "A director of alumni affairs at a small college wants to determine whether there is a linear relationship between the number of years alumni have been out of school and their annual contributions (in thousands of dollars)."
- Predictor: The number of years out of school
- Criterion: The annual contributions (in thousands of dollars)

Qualitative and Quantitative Variables

Qualitative variables: Have distinct categories based on some characteristic or attribute (e.g., gender, religious preference, location).
Quantitative variables: Can be counted (discrete) or measured (continuous) (e.g., age, height, weight, body temperature).

Discrete and Continuous Variables

Discrete variables: Assume values that can be counted (e.g., number of children in a family, number of students in a classroom, number of calls received by the emergency department).
Continuous variables: Assume an infinite number of values between any two specific values and are obtained by measuring (e.g., body temperature, height, weight).

Examples

Grade description (Passed, Failed, INC, UW) - Qualitative (QL)
Letter Grades (A, B, C, D, E, F) - Qualitative (QL)
General Weighted Average (0.00-100.00) - Quantitative Continuous (QTC)
Age in years (0, 1, 2, 3, …) - Quantitative Discrete (QTD)
Time of the day in 24H format (00:00:00-23:59:59) - Quantitative Continuous (QTC)
Age group (infant, children, adolescent, adult) - Qualitative (QL)
Age in years (0, 1, 2, 3, …) - Quantitative Discrete (QTD)
Time in seconds to finish a task - Quantitative Continuous (QTC)

Summation Notation

$\sum{i=1}^{n} Xi$
- $\sum$ : Greek uppercase letter sigma; used to denote sum
- $X_i$ : argument; expression to be added
- $i = 1$ : subscript; lower limit; initial term of $X$ in the summation
- $n$ : upper limit; final term of $X$ in the summation

Rules in Summation Notation

Rule 1: The sum of a constant from 1 to $n$ equals the product of the constant and $n$ . $\sum_{i=1}^{n} c = nc$
- Example: Evaluate $\sum{i=1}^{9} 4$ . $\sum{i=1}^{9} 4 = 9 * 4 = 36$
Rule 2: The sum of a variable and a constant equals the sum of the variable plus the product of the constant and $n$ . $\sum{i=1}^{n} Xi \pm c = \sum{i=1}^{n} Xi \pm nc$
- Example: Let $X1 = 3, X2 = 5, X3 = 2, X4 = 6, X5 = 3$ . Evaluate $\sum{i=1}^{5} (Xi + 7)$ . $\sum{i=1}^{5} (X_i + 7) = (3+5+2+6+3) + (5*7) = 19 + 35 = 54$
Rule 2: The difference of a variable and a constant equals the difference of the variable plus the product of the constant and $n$ . $\sum{i=1}^{n} Xi \pm c = \sum{i=1}^{n} Xi \pm nc$
- Example: Let $X1 = 3, X2 = 5, X3 = 2, X4 = 6, X5 = 3$ . Evaluate $\sum{i=1}^{5} (Xi - 3)$ . $\sum{i=1}^{5} (X_i - 3) = (3+5+2+6+3) - (5*3) = 19 - 15 = 4$
Rule 3: The sum of the product of a variable and a constant equals the product of the constant and the sum of the variable. $\sum{i=1}^{n} cXi = c \sum{i=1}^{n} Xi$
- Example: Let $X1 = 3, X2 = 5, X3 = 2, X4 = 6, X5 = 3$ . Evaluate $\sum{i=1}^{5} 7Xi$ . $\sum{i=1}^{5} 7X_i = 7 * (3+5+2+6+3) = 7 * 19 = 133$

Classifications of Data

Internal data: Relates to activities within the organization collecting the data.
- Example: health indicator data for the Department of Health.
External data: Relates to activities outside the organization collecting the data.
- Example: data obtained from computerized databases, books, periodicals, and government documents.
Statistical data: Published data from government institutions, companies, and associations involving figures, tables, and graphs.
Nonstatistical data: Information that does not involve figures, tables, and graphs.

Sources of Data

Observational data: From cross-sectional, retrospective, and prospective studies; collected from naturally occurring situations or for administrative purposes (e.g., medical records, government agencies, surveys).
Experimental data: Derived from planned experiments and clinical trials (e.g., survival data, recovery rates, relapse rates).

Types of Studies

Cross-Sectional Studies: Data referenced about a single point in time (now).
- Example: Surveys about present health characteristics of a population.
Retrospective Studies: Focus on risk factors or exposure factors in the past.
- Example: Case-control studies, where patients with a disease are asked about prior exposures.
Prospective Studies: Follow subjects from the present into the future.
- Example: Prospective cohort study, follows individuals who are free from disease but have an exposure factor.
Experimental Studies and Quality Control: Includes a study group, a control group, an independent (causal) variable, and a dependent (outcome) variable. Subjects are assigned randomly.
Clinical Trials: Experiment performed by a healthcare organization to evaluate the effect of an intervention or treatment against a control in a clinical environment.

Methods of Data Collection

Data collection: Gathering and measuring information on variables of interest in an established systematic fashion.

Methods

Indirect or Questionnaire method: Gathering information via a series of questions.
- Types of questions:
  - Closed-ended: Respondent picks from given options (Dichotomous, Nominal-polytomous, Ordinal-polytomous, Continuous)
  - Open-ended: Respondent formulates own answer.
Direct or Interview method: Asking questions and getting answers from participants.
Registration method: Governed by laws (e.g., birth/death certificate registration).
Observation method: Data pertaining to behaviors of individuals or groups.
Experiment method: Determining cause-and-effect relationships under controlled conditions.

Examples

PhilHealth region VIA supervisor would like to monitor the work behavior of his subordinates using a hidden camera. - Observation method
A researcher stood outside a vape shop to ask the users about their views on the health effects of vaping. - Interview method
A kinesiologist tested which of the three groups of people using different exercise programs prepared – push-pull-leg split, upper-lower, total body – is the most optimal in hypertrophy. - Experiment Method
A health educator would like to study the opinions of the 237 nursing students in Cavite on the new BS Nursing curriculum. - Questionnaire method
DOH issues and records COVID-19 vaccine certificate to the vaccinated people. - Registration method

SAMPLING METHOD

A. Definition of Terms

Population: The entire group of individuals of interest in a study.
- Target Population: The population from which representative information is desired and to which inferences will be made.
- Study Population: A subset of the target population that can be studied.
- Example: If studying pregnant women in Cavite, the target population is all pregnant women in Cavite, while the study population might be pregnant women attending maternity clinics in Cavite.
Sampling Frame: A roster of the sampled population.
Elementary Units: Individuals in the population on which a measurement is taken.
Sampling Units: Units chosen in selecting a sample (e.g., households, schools).

B. Criteria of Sampling Design

Representative of the Population: The sample must properly represent the population.
Reliability: It should be possible to measure the reliability of estimates made from the sample.
Practicable: The sampling design must be simple and straightforward to carry out.
Efficient and Economical: The design should produce the most information at the smallest cost.

C. Methods of Probability Sampling

Census: A count or measure of an entire population.
Sampling: Studying only a portion of the population.
Sample: A representative portion of the population.

Advantages of Sampling

Reduced Cost
Greater Speed
Greater Scope
Greater Accuracy: Less prone to non-sampling errors due to careful supervision and processing.

Probability vs. Non-Probability Sampling

Probability Sampling: Uses random selection, where each unit in the population has an equal chance of being included.
Non-Probability Sampling: Does not involve random selection.

Probability Sampling Methods

Simple Random Sampling:
- Selecting n sample size from the population using lottery, table of random numbers, or statistical software.
- Example: Randomly selecting 15 students from a class of 45.
Systematic Random Sampling:
- Selecting every kth individual from an ordered population, with the first individual selected at random.
- Formula: k = N/n (where k = sampling interval, N = population size, n = sample size)
- Example: Selecting every 40th employee from a list of 1200 employees to get a sample size of 30.
Stratified Random Sampling:
- Dividing the population into non-overlapping strata (homogenous within, heterogenous between).
- Types:
  - Equal Allocation: Equal sample sizes among the strata.
    - Formula: no. of samples per strata = required sample size / number of strata
    - Example: With 40 freshmen, 30 sophomores, 30 juniors, and 20 seniors, and a desired sample size of 40, using equal allocation, 10 samples are taken from each year level.
  - Proportional Allocation: Sample sizes proportional to the strata sizes.
    - Formula: no. of samples per stratum = (size per strata / total population size) * required sample size
    - Example: Freshmen (40), Sophomores (30), Juniors (30), Seniors (20), Total 120. Required sample size = 40. Freshman sample size = (40/120)*40 = 13.
Cluster Random Sampling:
- Dividing the population into clusters (preferably heterogenous) and randomly selecting clusters.
- Types:
  - One-Stage: All individuals in selected clusters are taken as samples.
  - Two-Stage: Random samples are taken from each selected cluster.
Multistage Sampling:
- Sampling in stages, drawing samples within samples.
- Example: Randomly selecting provinces, then municipalities within provinces, then barangays within municipalities, and finally households within barangays.

D. Non-Probability Sampling Methods

Convenience Sampling: Selecting readily available individuals.
- Example: Interviewing vaccinated patients outside an animal bite center.
Purposive Sampling: Selecting based on judgment and prior information.
Modal Instance Sampling: Also known as homogenous sampling. It aims to get a sample of people who have similar or identical traits.
- Example: Studying Filipino nurses in Saudi Arabia.
Heterogenous Sampling: Also known as maximum variation sampling.
- Selecting candidates across a broad spectrum relating to the topic of study.
- Example: Selecting Filipinos from different faith backgrounds to get the perspectives of religious Filipinos on pro-life advocacies
Expert Sampling: Selecting based on demonstrable expertise.
- Example: Interviewing experts in Ayurvedic and Filipino Folk Medicine.
Snowball Sampling: Used when the population is hard to reach; participants recruit other participants.
- Example: Researching recreational marijuana users.
Quota Sampling: Non-random selection of a predetermined number or proportion of units.
- Types:
  - Proportional: Quota reflects population proportions. Example: If a company has 600 drivers and 400 train-riders, a sample of 100 should have 60 drivers and 40 train-riders.
  - Non-Proportional: Quota is determined in advance. Example: If you decide to draw a sample of 100 people, including a quota of 50 people under 40 and a quota of 50 people over 40.

Glossary

Population: The entire group of individuals of interest.
Sample: A subset of the population used for study.
Strata: Non-overlapping subgroups within a population (used in stratified sampling).
Cluster: A grouping of population elements (used in cluster sampling).
Sampling Frame: A list of all elements in the population from which the sample is drawn.
Random Selection: Selection process where each element has an equal chance of being chosen.

SAMPLE SIZE DETERMINATION

A. Sample Size

Definition: The number of observations in a sample (n), crucial for accurate standard error estimation. N represents population size.
Importance: Estimates the number of subjects needed to detect an association, considering Type I (false positive) and Type II (false negative) errors.
Critical Components (Miaoulis & Michener, 1976):
1. Level of precision
2. Level of confidence or risk
3. Degree of variability in attributes measured

Level of Statistical Precision/Sampling Error

Definition: Closeness between calculated and population values. Estimated by standard error.
Descriptive Estimation: Difference between sample estimate and population parameters.
- Formula:
Inferential Estimation: Used to estimate the significance difference between or among parameter estimates.
- Formula:

Confidence / Risk Level

Definition: The degree to which an assumption or number is likely to be true.
Interpretation: Statistical measure of how often results fall within a specified range (e.g., 95% confidence means 95 out of 100 times).

Degree of Variability

Impact:
- Heterogeneous population: Requires larger sample size.
- Homogeneous population: Smaller sample size is sufficient.

B. Methods of Determining Sample Size

Four Methods (Cochran, Gupta & Kapoor, Israel et al.):
1. Census method (small populations)
2. Replicate sample size from similar studies
3. Sample size from published tables
4. Applying formulas to calculate sample size

1. Census Method

Application: Suitable for very small populations.
Advantage: High accuracy and preciseness
Disadvantage: Costly for large populations.

2. Replicate a Sample Size

Application: When research is in a similar field with available literature.
Disadvantage: Potential to repeat errors from previous studies.

3. Sample Size from Published Tables

Method: Using pre-defined tables based on specified criteria.

4. Applying Formulas

Advantage: Customizable based on research type and precision.
Common Formulas:
1. Estimating the mean or average
2. Estimating proportion (Infinite population)
3. Yamane (Simplified form of Proportions for finite population)
4. Infinite Population Correction
5. Finite Population Correction

1. Estimating the Mean or Average

Formula:
Estimating Sigma (σ) when unknown: Conduct preliminary survey or use results from previous studies.
Note: Sample size should be at least 30.
Example: A health care professional wishes to estimate the birth weights of infants. How large a sample must be obtained if she desires to be 90% confident that the true mean is within 2 ounces of the sample mean? Assuming that the population standard deviation is 8 ounces.
- Solution: Given: Z = 1.65 (90% confidence level), E = 2,and σ =8
- Formula:
- Solution:
- Answer: The sample must be obtained is 44 infants.

2. Estimating Proportion (Infinite Population)

Formula:
If p is unknown: Use p = 0.5 for maximum sample size.
Example: A federal report indicated that 27% of children ages 2 to 5 years had a good diet. How large a sample is needed to estimate the true proportion of children with good diets within 2% with 95% confidence?
- Solution: Given: Z = 1.96 (95% confidence level), E = 0.02, p=0.27, and q =1- 0.27 = 0.73
- Formula:
- Solution:
- Answer: The sample size of children with good diet is 1,893.

3. Yamane's Formula (Finite Population)

Formula: n = N / (1 + Ne2) (N = population size, e = level of precision)
Example: Aresearcher plans to conduct a survey about food preference of BS Nursing students. If the population of students is 500 find the sample size if the error is 5%.
- Solution: Given: N = 500 and e = 0.05
- Formula: n = N / (1 + Ne2)
- Solution: n = 500 / (1 + 500(0.05)2) = 222.22... = 223
- Answer: The researcher needs to get 223 BS Nursing students for his study.

4. Infinite Population Correction

Formula:

5. Finite Population Correction

Formula: .

C. Power Analysis

Purpose: Determines the minimum sample size needed.
Factors: Power, effect size, significance level.
Power (1-β): Ability to correctly reject a false null hypothesis. Adequate level: 80% or higher.
Effect Size: Magnitude of the effect of independent variables on the dependent variable.
- Small: 0.02
- Medium: 0.15
- Large: 0.35 (Cohen, 1988)
Significance Level (α): Probability of rejecting the null hypothesis.
- Social and Behavioral Science: Generally 0.05 (5%)
- Medical Research: Generally 0.01 (1%)
- Confidence Level = 1 - α

**D. G*Power Calculator**

Description: Software for calculating sample size and power for various statistical tests.
Steps:
1. Establish research goals and hypotheses.
2. Choose appropriate statistical tests.
 - Distribution-based approach (exact, F, t, χ2, z tests)
 - Design-based approach
3. Choose a power analysis method.
4. Input required variables and calculate.

**Power Analysis Methods in G*Power**

A Priori Analysis

Purpose: Sample size calculation performed before the study. Used to determine the sample size N.

Post-Hoc Analysis

Purpose: Conducted after study completion to calculate power level (1-β).
Limitation: Only controls α, not β.

**E. G*Power Calculator for Survey Research**

Uses three models: Simple, Mediation, and Moderation.

**Steps for G*Power:**

Choose "F tests" from the test family.
Select "Linear multiple regression: fixed model, R² deviation from zero".
Set power analysis to "A priori".
Specify effect size (0.15 is medium effect), α (0.05), and power (0.80). These are common settings for social and business science research.
Enter the number of predictors (maximum arrows pointing to a dependent variable).
Click Calculate.

Example (Simple Model): Three predictors require a minimum sample size of 77.
Example (Mediation Model): Four predictors require a minimum sample size of 85.
Example (Moderation Model): Seven predictors require a minimum sample size of 103.

**F. G*Power Calculator for Experimental Research**

Choose "t tests" or "F tests" in the test family.
Select the appropriate statistical test (e.g., "Means: Difference between two dependent means (matched pairs)" for paired samples t-test).
Input effect size (e.g., 0.5 for medium effect), power, and alpha levels.
Use alpha = 0.01 as a safe bet, unless previous research indicates to lower this value.

Glossary

Sample Size (n): The number of observations included in a sample.
Population Size (N): The total number of individuals in a population.
Standard Error: A measure of the statistical accuracy of an estimate.
Type I Error (False Positive): Rejecting a true null hypothesis.
Type II Error (False Negative): Failing to reject a false null hypothesis.
Level of Precision (E): The acceptable margin of error in an estimate.
Level of Confidence: The probability that an estimate falls within a certain range.
Degree of Variability: The extent to which data points in a population differ from each other.
Power (1-β): The probability of correctly rejecting a false null hypothesis.
Effect Size: The magnitude of the effect of an independent variable on a dependent variable.
Significance Level (α): The probability of rejecting the null hypothesis when it is true.