Biostatistics and Epidemiology Notes
Biostatistics
Statistics is a branch of mathematics dealing with the collection, organization, presentation, analysis, and interpretation of data.
Two main areas:
Mathematical statistics: Development of new statistical inference methods, requiring detailed knowledge of abstract mathematics.
Applied statistics: Application of mathematical statistics methods to various subjects (e.g., economics, psychology, public health).
Branches of Statistics
Descriptive statistics: Methods for summarizing and presenting data in an easier-to-analyze and interpret form.
Includes measures of central tendency (mean, median, mode) and measures of variability (range, variance, standard deviation).
Inferential statistics: Generalizing from samples to populations, performing estimations and hypothesis tests, and making predictions.
Involves techniques like hypothesis testing (Z-test, T-test, ANOVA test, Wilcoxon Signed Rank Test, Mann-Whitney U Test), regression analysis (linear, nominal, logistic, ordinal).
Examples
Descriptive: "In 2011, there were 34 deaths from the avian flu." (Source: WHO)
Inferential: "In 2025, the world population is predicted to be 8 billion people." (Source: UN)
Inferential: "Based on a sample of 2739 respondents, it is estimated that all pet owners spent a total of 14 billion dollars on veterinarian care for their pets." (Source: American Pet Products Association, Pet Owners Survey)
Descriptive: "The median cost of 2739 respondents on veterinarian care for their pets is 12.5 billion dollars."
Inferential: "Scientists at the University of Oxford in England found that a good laugh significantly raises a person’s pain level tolerance."
Inferential: "Drinking decaffeinated coffee can raise cholesterol levels by 7%" (Source: American Heart Association)
Descriptive: "The average stay in a hospital for 2000 patients who had circulatory system problems was 4.7 days."
Descriptive: "In a poll of 3036 adults, 32% said that they got a flu shot at a retail clinic" (Source: Harris Interactive Poll)
Descriptive: "In the Philippines, there were 894 COVID cases reported on Sep 4-10, 15.0% higher than the 780 cases reported on Aug 28-Sep 3. (WHO)"
Inferential: "It is projected that the average total number of adults and children receiving ARV treatment will be 31.7 million people by 2024. (aidsdatahub.org)"
Classification of Statistics
Parametric statistics: Assumes random sample from a normal distribution and involves testing of hypotheses about the population mean.
Assumptions include normality, homoscedasticity (equal variances), interval or ratio level measurement, and absence of outliers.
Nonparametric statistics: Does not assume any underlying data distribution and involves hypothesis testing about a population median.
Less powerful than parametric tests, requiring slightly larger sample sizes.
Parametric vs. Non-Parametric Tests
Parametric Tests
One Sample: z test, t test.
Two Sample:
Independent Sample: t test or z test.
Paired Sample: paired t test.
Non-Parametric Tests
One Sample: Chi-Square, K-S, Runs test, Binomial
Two Sample:
Independent Sample: Chi-Square, Mann-Whitney, Median, K-S
Paired Sample: Sign, Wilcoxon, McNemar, Chi-square
Epidemiology
Epidemiology: Study of the distribution and determinants of health-related conditions in human populations and the application of this method to control health problems.
Originally dealt with infectious diseases but now includes chronic diseases and health-related conditions (e.g., heart attacks, car accidents, autism, arthritis).
Distribution: Refers to time, place, and types of persons affected.
Determinants: Physical, biological, social, cultural, and behaviors that influence health.
Terminologies in Biostatistics
Variable: A characteristic or attribute that can assume different values (e.g., height, weight, BMI, blood pressure).
Data: Values (measurements or observations) that the variable can assume.
Data set: A collection of data values; each value is a datum or data point.
Population: All subjects (human or otherwise) being studied.
Sample: A group of subjects selected from a population.
Census: The process of collecting data from every subject in the population.
In the Philippines, the Philippine Statistics Authority is responsible for national censuses and surveys.
Parameter: A characteristic or measure obtained by using all values from a specific population.
Example: average income of all nurses in the Philippines
Statistic: A characteristic or measure obtained by using data values from a sample.
Example: average income of a sample of 15 nurses
Measurements
Four levels of measurement (lowest to highest): nominal, ordinal, interval, and ratio.
Nominal: Classifies data into mutually exclusive categories with no order or ranking (e.g., name, gender, zip code, student number, marital status).
Ordinal: Classifies data into categories that can be ranked, but precise differences do not exist (e.g., rankings, performance evaluations).
Interval: Ranks data with precise differences between units, but no meaningful zero exists (e.g., IQ score, temperature).
Ratio: Possesses all characteristics of interval measurement and has a true zero (e.g., height, weight).
Levels of Measurement Table
Level | Mutually Exclusive Categories | Can be Ordered | Meaningful Differences | Can be Added/Subtracted | Meaningful Zero | Can be Multiplied/Divided |
|---|---|---|---|---|---|---|
Nominal | ✓ | |||||
Ordinal | ✓ | ✓ | ||||
Interval | ✓ | ✓ | ✓ | ✓ | ||
Ratio | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Examples
Grade description (Passed, Failed, INC, UW) - Nominal (N)
Letter Grades (A, B, C, D, E, F) - Ordinal (O)
General Weighted Average (40.00-100.00) - Interval (I)
Contact Numbers - Nominal (N)
Time of the day in 24H format (00:00:00-23:59:59) - Interval (I)
Age group (infant, children, adolescent, adult) - Ordinal (O)
Age in years (0, 1, 2, 3, …) - Ratio (R)
Time in seconds to finish a task - Ratio (R)
Variables
Experimental Variables
Response variable (dependent variable): Affected by the value of some other variable.
Explanatory variable (independent variable): Thought to affect the values of the response variable.
Examples
Research Question: "Do tomatoes grow fastest under fluorescent, incandescent, or natural light?"
Independent: Type of light the tomato plant is grown under
Dependent: The rate of growth of the tomato plant
Research Question: "What is the effect of intermittent fasting on blood sugar levels?"
Independent: Presence or absence of intermittent fasting
Dependent: Blood sugar levels
Correlation Studies
Predictor variable: Same as the independent variable.
Criterion variable: Same as the dependent variable.
Correlational research does not identify independent and dependent variables, as analysis is not dependent on the direction of the relationship.
Examples
Research Question: "An economist wants to determine whether there is a linear relationship between a country’s gross domestic product (GDP) and carbon dioxide (CO2) emissions."
Predictor: Gross Domestic Product
Criterion: Carbon dioxide (CO2) emissions
Research Question: "A director of alumni affairs at a small college wants to determine whether there is a linear relationship between the number of years alumni have been out of school and their annual contributions (in thousands of dollars)."
Predictor: The number of years out of school
Criterion: The annual contributions (in thousands of dollars)
Qualitative and Quantitative Variables
Qualitative variables: Have distinct categories based on some characteristic or attribute (e.g., gender, religious preference, location).
Quantitative variables: Can be counted (discrete) or measured (continuous) (e.g., age, height, weight, body temperature).
Discrete and Continuous Variables
Discrete variables: Assume values that can be counted (e.g., number of children in a family, number of students in a classroom, number of calls received by the emergency department).
Continuous variables: Assume an infinite number of values between any two specific values and are obtained by measuring (e.g., body temperature, height, weight).
Examples
Grade description (Passed, Failed, INC, UW) - Qualitative (QL)
Letter Grades (A, B, C, D, E, F) - Qualitative (QL)
General Weighted Average (0.00-100.00) - Quantitative Continuous (QTC)
Age in years (0, 1, 2, 3, …) - Quantitative Discrete (QTD)
Time of the day in 24H format (00:00:00-23:59:59) - Quantitative Continuous (QTC)
Age group (infant, children, adolescent, adult) - Qualitative (QL)
Age in years (0, 1, 2, 3, …) - Quantitative Discrete (QTD)
Time in seconds to finish a task - Quantitative Continuous (QTC)
Summation Notation
: Greek uppercase letter sigma; used to denote sum
: argument; expression to be added
: subscript; lower limit; initial term of in the summation
: upper limit; final term of in the summation
Rules in Summation Notation
Rule 1: The sum of a constant from 1 to equals the product of the constant and .
Example: Evaluate .
Rule 2: The sum of a variable and a constant equals the sum of the variable plus the product of the constant and .
Example: Let . Evaluate .
Rule 2: The difference of a variable and a constant equals the difference of the variable plus the product of the constant and .
Example: Let . Evaluate .
Rule 3: The sum of the product of a variable and a constant equals the product of the constant and the sum of the variable.
Example: Let . Evaluate .
Classifications of Data
Internal data: Relates to activities within the organization collecting the data.
Example: health indicator data for the Department of Health.
External data: Relates to activities outside the organization collecting the data.
Example: data obtained from computerized databases, books, periodicals, and government documents.
Statistical data: Published data from government institutions, companies, and associations involving figures, tables, and graphs.
Nonstatistical data: Information that does not involve figures, tables, and graphs.
Sources of Data
Observational data: From cross-sectional, retrospective, and prospective studies; collected from naturally occurring situations or for administrative purposes (e.g., medical records, government agencies, surveys).
Experimental data: Derived from planned experiments and clinical trials (e.g., survival data, recovery rates, relapse rates).
Types of Studies
Cross-Sectional Studies: Data referenced about a single point in time (now).
Example: Surveys about present health characteristics of a population.
Retrospective Studies: Focus on risk factors or exposure factors in the past.
Example: Case-control studies, where patients with a disease are asked about prior exposures.
Prospective Studies: Follow subjects from the present into the future.
Example: Prospective cohort study, follows individuals who are free from disease but have an exposure factor.
Experimental Studies and Quality Control: Includes a study group, a control group, an independent (causal) variable, and a dependent (outcome) variable. Subjects are assigned randomly.
Clinical Trials: Experiment performed by a healthcare organization to evaluate the effect of an intervention or treatment against a control in a clinical environment.
Methods of Data Collection
Data collection: Gathering and measuring information on variables of interest in an established systematic fashion.
Methods
Indirect or Questionnaire method: Gathering information via a series of questions.
Types of questions:
Closed-ended: Respondent picks from given options (Dichotomous, Nominal-polytomous, Ordinal-polytomous, Continuous)
Open-ended: Respondent formulates own answer.
Direct or Interview method: Asking questions and getting answers from participants.
Registration method: Governed by laws (e.g., birth/death certificate registration).
Observation method: Data pertaining to behaviors of individuals or groups.
Experiment method: Determining cause-and-effect relationships under controlled conditions.
Examples
PhilHealth region VIA supervisor would like to monitor the work behavior of his subordinates using a hidden camera. - Observation method
A researcher stood outside a vape shop to ask the users about their views on the health effects of vaping. - Interview method
A kinesiologist tested which of the three groups of people using different exercise programs prepared – push-pull-leg split, upper-lower, total body – is the most optimal in hypertrophy. - Experiment Method
A health educator would like to study the opinions of the 237 nursing students in Cavite on the new BS Nursing curriculum. - Questionnaire method
DOH issues and records COVID-19 vaccine certificate to the vaccinated people. - Registration method
SAMPLING METHOD
A. Definition of Terms
Population: The entire group of individuals of interest in a study.
Target Population: The population from which representative information is desired and to which inferences will be made.
Study Population: A subset of the target population that can be studied.
Example: If studying pregnant women in Cavite, the target population is all pregnant women in Cavite, while the study population might be pregnant women attending maternity clinics in Cavite.
Sampling Frame: A roster of the sampled population.
Elementary Units: Individuals in the population on which a measurement is taken.
Sampling Units: Units chosen in selecting a sample (e.g., households, schools).
B. Criteria of Sampling Design
Representative of the Population: The sample must properly represent the population.
Reliability: It should be possible to measure the reliability of estimates made from the sample.
Practicable: The sampling design must be simple and straightforward to carry out.
Efficient and Economical: The design should produce the most information at the smallest cost.
C. Methods of Probability Sampling
Census: A count or measure of an entire population.
Sampling: Studying only a portion of the population.
Sample: A representative portion of the population.
Advantages of Sampling
Reduced Cost
Greater Speed
Greater Scope
Greater Accuracy: Less prone to non-sampling errors due to careful supervision and processing.
Probability vs. Non-Probability Sampling
Probability Sampling: Uses random selection, where each unit in the population has an equal chance of being included.
Non-Probability Sampling: Does not involve random selection.
Probability Sampling Methods
Simple Random Sampling:
Selecting n sample size from the population using lottery, table of random numbers, or statistical software.
Example: Randomly selecting 15 students from a class of 45.
Systematic Random Sampling:
Selecting every kth individual from an ordered population, with the first individual selected at random.
Formula: k = N/n (where k = sampling interval, N = population size, n = sample size)
Example: Selecting every 40th employee from a list of 1200 employees to get a sample size of 30.
Stratified Random Sampling:
Dividing the population into non-overlapping strata (homogenous within, heterogenous between).
Types:
Equal Allocation: Equal sample sizes among the strata.
Formula: no. of samples per strata = required sample size / number of strata
Example: With 40 freshmen, 30 sophomores, 30 juniors, and 20 seniors, and a desired sample size of 40, using equal allocation, 10 samples are taken from each year level.
Proportional Allocation: Sample sizes proportional to the strata sizes.
Formula: no. of samples per stratum = (size per strata / total population size) * required sample size
Example: Freshmen (40), Sophomores (30), Juniors (30), Seniors (20), Total 120. Required sample size = 40. Freshman sample size = (40/120)*40 = 13.
Cluster Random Sampling:
Dividing the population into clusters (preferably heterogenous) and randomly selecting clusters.
Types:
One-Stage: All individuals in selected clusters are taken as samples.
Two-Stage: Random samples are taken from each selected cluster.
Multistage Sampling:
Sampling in stages, drawing samples within samples.
Example: Randomly selecting provinces, then municipalities within provinces, then barangays within municipalities, and finally households within barangays.
D. Non-Probability Sampling Methods
Convenience Sampling: Selecting readily available individuals.
Example: Interviewing vaccinated patients outside an animal bite center.
Purposive Sampling: Selecting based on judgment and prior information.
Modal Instance Sampling: Also known as homogenous sampling. It aims to get a sample of people who have similar or identical traits.
Example: Studying Filipino nurses in Saudi Arabia.
Heterogenous Sampling: Also known as maximum variation sampling.
Selecting candidates across a broad spectrum relating to the topic of study.
Example: Selecting Filipinos from different faith backgrounds to get the perspectives of religious Filipinos on pro-life advocacies
Expert Sampling: Selecting based on demonstrable expertise.
Example: Interviewing experts in Ayurvedic and Filipino Folk Medicine.
Snowball Sampling: Used when the population is hard to reach; participants recruit other participants.
Example: Researching recreational marijuana users.
Quota Sampling: Non-random selection of a predetermined number or proportion of units.
Types:
Proportional: Quota reflects population proportions. Example: If a company has 600 drivers and 400 train-riders, a sample of 100 should have 60 drivers and 40 train-riders.
Non-Proportional: Quota is determined in advance. Example: If you decide to draw a sample of 100 people, including a quota of 50 people under 40 and a quota of 50 people over 40.
Glossary
Population: The entire group of individuals of interest.
Sample: A subset of the population used for study.
Strata: Non-overlapping subgroups within a population (used in stratified sampling).
Cluster: A grouping of population elements (used in cluster sampling).
Sampling Frame: A list of all elements in the population from which the sample is drawn.
Random Selection: Selection process where each element has an equal chance of being chosen.
SAMPLE SIZE DETERMINATION
A. Sample Size
Definition: The number of observations in a sample (n), crucial for accurate standard error estimation. N represents population size.
Importance: Estimates the number of subjects needed to detect an association, considering Type I (false positive) and Type II (false negative) errors.
Critical Components (Miaoulis & Michener, 1976):
Level of precision
Level of confidence or risk
Degree of variability in attributes measured
Level of Statistical Precision/Sampling Error
Definition: Closeness between calculated and population values. Estimated by standard error.
Descriptive Estimation: Difference between sample estimate and population parameters.
Formula:
Inferential Estimation: Used to estimate the significance difference between or among parameter estimates.
Formula:
Confidence / Risk Level
Definition: The degree to which an assumption or number is likely to be true.
Interpretation: Statistical measure of how often results fall within a specified range (e.g., 95% confidence means 95 out of 100 times).
Degree of Variability
Impact:
Heterogeneous population: Requires larger sample size.
Homogeneous population: Smaller sample size is sufficient.
B. Methods of Determining Sample Size
Four Methods (Cochran, Gupta & Kapoor, Israel et al.):
Census method (small populations)
Replicate sample size from similar studies
Sample size from published tables
Applying formulas to calculate sample size
1. Census Method
Application: Suitable for very small populations.
Advantage: High accuracy and preciseness
Disadvantage: Costly for large populations.
2. Replicate a Sample Size
Application: When research is in a similar field with available literature.
Disadvantage: Potential to repeat errors from previous studies.
3. Sample Size from Published Tables
Method: Using pre-defined tables based on specified criteria.
4. Applying Formulas
Advantage: Customizable based on research type and precision.
Common Formulas:
Estimating the mean or average
Estimating proportion (Infinite population)
Yamane (Simplified form of Proportions for finite population)
Infinite Population Correction
Finite Population Correction
1. Estimating the Mean or Average
Formula:
Estimating Sigma (σ) when unknown: Conduct preliminary survey or use results from previous studies.
Note: Sample size should be at least 30.
Example: A health care professional wishes to estimate the birth weights of infants. How large a sample must be obtained if she desires to be 90% confident that the true mean is within 2 ounces of the sample mean? Assuming that the population standard deviation is 8 ounces.
Solution: Given: Z = 1.65 (90% confidence level), E = 2,and σ =8
Formula:
Solution:
Answer: The sample must be obtained is 44 infants.
2. Estimating Proportion (Infinite Population)
Formula:
If p is unknown: Use p = 0.5 for maximum sample size.
Example: A federal report indicated that 27% of children ages 2 to 5 years had a good diet. How large a sample is needed to estimate the true proportion of children with good diets within 2% with 95% confidence?
Solution: Given: Z = 1.96 (95% confidence level), E = 0.02, p=0.27, and q =1- 0.27 = 0.73
Formula:
Solution:
Answer: The sample size of children with good diet is 1,893.
3. Yamane's Formula (Finite Population)
Formula: n = N / (1 + Ne<sup>2</sup>) (N = population size, e = level of precision)
Example: Aresearcher plans to conduct a survey about food preference of BS Nursing students. If the population of students is 500 find the sample size if the error is 5%.
Solution: Given: N = 500 and e = 0.05
Formula: n = N / (1 + Ne<sup>2</sup>)
Solution: n = 500 / (1 + 500(0.05)<sup>2</sup>) = 222.22... = 223
Answer: The researcher needs to get 223 BS Nursing students for his study.
4. Infinite Population Correction
Formula:
5. Finite Population Correction
Formula: .
C. Power Analysis
Purpose: Determines the minimum sample size needed.
Factors: Power, effect size, significance level.
Power (1-β): Ability to correctly reject a false null hypothesis. Adequate level: 80% or higher.
Effect Size: Magnitude of the effect of independent variables on the dependent variable.
Small: 0.02
Medium: 0.15
Large: 0.35 (Cohen, 1988)
Significance Level (α): Probability of rejecting the null hypothesis.
Social and Behavioral Science: Generally 0.05 (5%)
Medical Research: Generally 0.01 (1%)
Confidence Level = 1 - α
D. G*Power Calculator
Description: Software for calculating sample size and power for various statistical tests.
Steps:
Establish research goals and hypotheses.
Choose appropriate statistical tests.
Distribution-based approach (exact, F, t, χ<sup>2</sup>, z tests)
Design-based approach
Choose a power analysis method.
Input required variables and calculate.
Power Analysis Methods in G*Power
| Type of Power Analysis | Input | Output | | :----------------------- | :------------------------------------- | :-------------------- | | A Priori | Effect Size, α, Power | Sample Size (N) | | Compromise | Effect Size, Power Ratio (β/α) | α and β | | Criterion | Effect Size, Power, Sample Size (N) | α | | Sensitivity | α, Power, Sample Size (N) | Effect Size | | Post-hoc | Effect Size, α, Sample Size (N) | Power (1-β) |
A Priori Analysis
Purpose: Sample size calculation performed before the study. Used to determine the sample size N.
Post-Hoc Analysis
Purpose: Conducted after study completion to calculate power level (1-β).
Limitation: Only controls α, not β.
E. G*Power Calculator for Survey Research
Uses three models: Simple, Mediation, and Moderation.
Steps for G*Power:
Choose "F tests" from the test family.
Select "Linear multiple regression: fixed model, R² deviation from zero".
Set power analysis to "A priori".
Specify effect size (0.15 is medium effect), α (0.05), and power (0.80). These are common settings for social and business science research.
Enter the number of predictors (maximum arrows pointing to a dependent variable).
Click Calculate.
Example (Simple Model): Three predictors require a minimum sample size of 77.
Example (Mediation Model): Four predictors require a minimum sample size of 85.
Example (Moderation Model): Seven predictors require a minimum sample size of 103.
F. G*Power Calculator for Experimental Research
Choose "t tests" or "F tests" in the test family.
Select the appropriate statistical test (e.g., "Means: Difference between two dependent means (matched pairs)" for paired samples t-test).
Input effect size (e.g., 0.5 for medium effect), power, and alpha levels.
Use alpha = 0.01 as a safe bet, unless previous research indicates to lower this value.
Glossary
Sample Size (n): The number of observations included in a sample.
Population Size (N): The total number of individuals in a population.
Standard Error: A measure of the statistical accuracy of an estimate.
Type I Error (False Positive): Rejecting a true null hypothesis.
Type II Error (False Negative): Failing to reject a false null hypothesis.
Level of Precision (E): The acceptable margin of error in an estimate.
Level of Confidence: The probability that an estimate falls within a certain range.
Degree of Variability: The extent to which data points in a population differ from each other.
Power (1-β): The probability of correctly rejecting a false null hypothesis.
Effect Size: The magnitude of the effect of an independent variable on a dependent variable.
Significance Level (α): The probability of rejecting the null hypothesis when it is true.