Intro to Biostatistics
Central Tendency
Purpose: Describe the typical subject in a dataset by summarizing a column of quantitative variables with a single number.
Three key measures: Mean, Median, Mode.
Relevance: Helps describe the typical breast cancer patient (age, socioeconomic status, general health) in public health datasets.
The Mean
Definition: The average of a data set.
Formula:
\bar{x} = \frac{1}{n} \sum{i=1}^{n} xiExample:
Data: 7, 4, 4, 5
Calculation: ( (7+4+4+5)/4 = 5 )
R:
x <- c(7,4,4,5)
mean(x)
# [1] 5
Robustness: The mean is sensitive to extreme values (not robust).
Example of non-robustness:
Replace 5 with 50: data = 7, 4, 4, 50
R:
y <- c(7,4,4,50)
mean(y)
# [1] 16.25
Interpretation: One inflated value greatly increases the mean.
Practical takeaway: In skewed distributions or when outliers are present, the mean may misrepresent the typical value.
The Median
Definition: The midpoint value; half of observations are below, half above.
If n is even: median is the mean of the two center values; if n is odd: median is the center observation.
Examples:
x <- c(1,4,3,2) -> median(x) = 2.5
y <- c(1,40,3,2) -> median(y) = 2.5
Robustness: The median is robust to extreme values.
Practical takeaway: Use the median to describe the typical subject when data are skewed or contain outliers.
The Mode
Definition: The most frequently occurring value.
Notes:
Not always useful for continuous data without grouping; often used with categorical data.
Getting the Mode in R requires a package (DescTools).
R usage:
install.packages("DescTools")
library(DescTools)
x <- c(7,4,4,5)
Mode(x)
# [1] 4
attr(,"freq")
# [1] 2
Interpretation: Four occurs twice.
Example with categorical data:
Sex <- c("M","F","M","F","F")
Mode(Sex)
# [1] "F"
attr(,"freq")
# [1] 3
table(Sex)
# Sex F M
# 3 2
Practical takeaway: Mode can be informative for categorical data and for identifying the most common category.
Skewness and Mean vs Median
Definitions:
Left-skewed distribution: tail to the left; typically Mean < Median.
Right-skewed distribution: tail to the right; typically Mean > Median.
Symmetric distribution: Mean ≈ Median.
Rule of thumb:
When there is a large discrepancy between mean and median, the data are skewed and the median is often a better descriptor of the typical subject.
Summary statement from slide:
Mean and median relationship helps diagnose skewness in a distribution.
Measurements of Variability
Rationale: A measure of central tendency does not capture how spread out the data are. Variability measures provide a fuller picture.
Example to illustrate variability: two patients with SBP readings have the same mean but different variability (see slides’ table).
Range (Extent)
Definition: Range = max(x) − min(x).
Robustness: NOT robust to outliers.
R in-note:
x <- c(7,4,4,5)
max(x) - min(x)
# [1] 3
Standard Deviation (and Variance)
Variance definition (sample): s^2 = \frac{\sum{i=1}^{n} (xi - \bar{x})^2}{n-1}
Standard deviation: s = \sqrt{s^2}
Relationship to variance: The standard deviation is the square root of the variance.
Computation notes:
The sample standard deviation is typically used with samples: divide by (n − 1).
In R:
x <- c(7,4,4,5)
y <- c(1,1,1,1)
sd(x) # [1] 1.414214
sd(y) # [1] 0
Importance: Std. dev. is one of the most important dispersion measures, but it is not robust to outliers.
Quartiles and Interquartile Range (IQR)
Quartiles:
Q1: 25th percentile (25% of data below, 75% above).
Q3: 75th percentile (75% below, 25% above).
Calculation: Multiple methods exist; this course uses R's default method.
IQR:
Definition: IQR = Q3 − Q1
Robustness: The IQR is robust to outliers.
Example (death data):
summary(death) yields:
Min = 0.600, 1st Qu. = 2.300, Median (Q2) = 3.400, Mean = 3.312, 3rd Qu. = 4.200, Max = 6.100
Therefore:
Q1 = 2.300, Q3 = 4.200
IQR = 4.200 − 2.300 = 1.9
The Box Plot and the 5-Number Summary
Five-number summary: Min, Q1, Median (Q2), Q3, Max.
In R, the boxplot presents this summary graphically.
Box plot example (time until death):
Min = 0.6, Q1 = 2.3, Median = 3.4 (Q2), Q3 = 4.2, Max = 6.1
Box plot interpretation:
Box bounds indicate Q1 and Q3; the line inside the box marks the median; whiskers extend to Min and Max (within data range, not just IQR).
Outliers in Box Plots
Outlier rule (increased robustness):
Any value < Q1 − 1.5 × IQR or > Q3 + 1.5 × IQR is an outlier.
Outliers are typically shown as circles or asterisks on the box plot.
Important caution: Outliers should be investigated, not automatically deleted; they can be data entry errors, measurement issues, or true extreme values.
Example (conceptual): adding 10 to the data set can produce a much more extreme value, illustrating why outliers merit scrutiny.
Data Visualization
Numeric data: histograms bin numeric values to show frequency within each class.
Command example: hist(death, main = "Time until Death")
Histograms illustrate data distribution shape and potential skewness.
Categorical data: bar plots summarize distribution across categories.
Example:
sex <- c("M","M","F","F","F","M","F","F","F")
mytable <- table(sex)
barplot(mytable, main = "Gender Distribution")
Note: For more sophisticated visuals, ggplot2 can be used, but the course demonstrates basic functions.
Week 2 Homework and Activities
Homework overview:
Under Additional Resources in Week 2, Biostatistics Practice Quiz #1 dataset is available.
The Practice Quiz #1 can be taken in Canvas Week 3 Module with unlimited attempts and with answers provided.
Questions 12, 13, 14, 15, 16, and 20 cover material to be discussed in Week 3 AM; review and think about answers.
At the end of Week 3 AM, discuss Practice Quiz problems in class if desired.
The Week 4 PM session will include the actual quiz, consisting of 20 multiple-choice questions focusing on the same concepts; you will need to use R for some questions.
R Lab & HDR Lab Reminders
BRFSS23 health condition: select and read BRFSS23 into R; practice filtering by geographic area (use
_MMSA).Familiarize yourself with dataset variables and their types.
First R lab session scheduled for next week (Wednesday PM) with an R Lab Assignment and HDR work.
R Markdown: Short Practice Activity
Task flow:
1) Download Framingham Dataset from Canvas Week 2 resources.
2) Determine Mean, Median, and Std. Dev of BMI.
3) Assess if BMI is skewed.
4) Determine if BMI has outliers.
5) Create a histogram of BMI.
6) Create a Bar Plot of the variable "diabetes" and compute what percent of patients have diabetes.
R Markdown: Short Activity Details
Activity 1: Import Framingham Dataset; examine for misread variables; convert types as needed (e.g., cigsPerDay should be numeric).
Note: Any missing data is coerced to NA.
Activity 2: Compute Mean, Median, and Std. Dev of BMI with na.rm = TRUE.
Sample results (from the provided outputs):
Mean(BMI) ≈ 25.8008
Median(BMI) ≈ 25.4
SD(BMI) ≈ 4.07984
Activity 3: Is BMI skewed? Interpretation: Mean > Median indicates right skew; larger sample sizes tend to attenuate the influence of outliers on the mean.
Activity 4: Determine if BMI has outliers via a boxplot; interpretation: BMI shows many large values consistent with obesity; boxplot reveals extreme values as outliers.
R: boxplot(framingham$BMI, main = "BMI of Patients", horizontal = TRUE)
Activity 5: Create a histogram of BMI; interpretation: Histogram confirms right skew and potential outliers.
Activity 6: Create a Bar Plot of Diabetes; question: What percent have diabetes?
R:
mytable <- table(framingham$diabetes)
barplot(mytable)
To compute percent: (freq / sum(freq)) * 100 for the diabetes categories.
R Markdown: Short Activity 2 Details
Activity 2: Import the Framingham dataset; fix data types as needed. Example fix for cigsPerDay:
framingham$cigsPerDay <- as.numeric(framingham$cigsPerDay)
Note: Missing data is coerced to NA automatically if not handled.
R Markdown: Short Activity 3 Details (BMI Statistics)
Code and results:
framingham$BMI <- as.numeric(framingham$BMI)
mean(framingham$BMI, na.rm = TRUE)
median(framingham$BMI, na.rm = TRUE)
sd(framingham$BMI, na.rm = TRUE)
Result example (from provided output): mean ≈ 25.8008, median ≈ 25.4, sd ≈ 4.07984
Interpretation:
BMI is technically right-skewed since Mean > Median, especially with large N; despite similar mean and median values, a large sample size dampens outlier effects on the mean.
R Markdown: Short Activity 4–6 (Visualizations and Diabetes)
Activity 4: Boxplot for BMI to assess outliers:
boxplot(framingham$BMI, main = "BMI of Patients", horizontal = TRUE)
Conclusion: BMI has many large and extreme values; outliers are visible as points outside whiskers.
Activity 5: Histogram of BMI:
hist(framingham$BMI, main = "BMI of Patients")
Conclusion: Right-skewed distribution with potential outliers.
Activity 6: Bar Plot for Diabetes:
mytable <- table(framingham$diabetes)
barplot(mytable)
Question: What percent have diabetes? Calculation: (frequency of diabetes) / (total observations) × 100.
Practical Takeaways and Connections
When describing public health datasets, always report both a location (central tendency) and a measure of spread (variability).
For skewed data or data with outliers, rely more on the median and IQR than the mean and standard deviation.
Visualizations (histograms, box plots, bar plots) are essential for understanding distribution shape, skewness, and outliers.
R is used throughout for computing statistics and generating visuals; familiarity with basic functions (mean, median, sd, IQR, boxplot, hist, barplot, table) is important.
Notable Formulas and Key References
Mean: \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi
Range: \text{Range} = \max(xi) - \min(xi)
Variance (sample): s^2 = \frac{\sum{i=1}^{n} (xi - \bar{x})^2}{n-1}
Standard deviation: s = \sqrt{s^2}
Quartiles and IQR: \text{IQR} = Q3 - Q1
Outlier rule: values outside [Q1 - 1.5 \cdot IQR, \ Q3 + 1.5 \cdot IQR] are considered outliers.
Box plot five-number summary: Min, Q1, Median, Q3, Max.
Skewness interpretation:
Right-skew: Mean > Median; data stretched to the right.
Left-skew: Mean < Median; data stretched to the left.
Symmetric: Mean ≈ Median.
Session Note
Looking ahead: Dr. Michael Swain session in 3420 CCCB to discuss natural history of disease, NLM Encyclopedia, and Zotero for reference management.