Intro to Biostatistics

Central Tendency

  • Purpose: Describe the typical subject in a dataset by summarizing a column of quantitative variables with a single number.

  • Three key measures: Mean, Median, Mode.

  • Relevance: Helps describe the typical breast cancer patient (age, socioeconomic status, general health) in public health datasets.

The Mean

  • Definition: The average of a data set.

  • Formula:
    \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi

  • Example:

    • Data: 7, 4, 4, 5

    • Calculation: ( (7+4+4+5)/4 = 5 )

    • R:

  x <- c(7,4,4,5)
  mean(x)
  # [1] 5
  • Robustness: The mean is sensitive to extreme values (not robust).

  • Example of non-robustness:

    • Replace 5 with 50: data = 7, 4, 4, 50

    • R:

  y <- c(7,4,4,50)
  mean(y)
  # [1] 16.25
  • Interpretation: One inflated value greatly increases the mean.

    • Practical takeaway: In skewed distributions or when outliers are present, the mean may misrepresent the typical value.

The Median

  • Definition: The midpoint value; half of observations are below, half above.

  • If n is even: median is the mean of the two center values; if n is odd: median is the center observation.

  • Examples:

    • x <- c(1,4,3,2) -> median(x) = 2.5

    • y <- c(1,40,3,2) -> median(y) = 2.5

  • Robustness: The median is robust to extreme values.

  • Practical takeaway: Use the median to describe the typical subject when data are skewed or contain outliers.

The Mode

  • Definition: The most frequently occurring value.

  • Notes:

    • Not always useful for continuous data without grouping; often used with categorical data.

    • Getting the Mode in R requires a package (DescTools).

  • R usage:

  install.packages("DescTools")
  library(DescTools)
  x <- c(7,4,4,5)
  Mode(x)
  # [1] 4
  attr(,"freq")
  # [1] 2
  • Interpretation: Four occurs twice.

    • Example with categorical data:

  Sex <- c("M","F","M","F","F")
  Mode(Sex)
  # [1] "F"
  attr(,"freq")
  # [1] 3
  table(Sex)
  # Sex F M
  #   3 2
  • Practical takeaway: Mode can be informative for categorical data and for identifying the most common category.

Skewness and Mean vs Median

  • Definitions:

    • Left-skewed distribution: tail to the left; typically Mean < Median.

    • Right-skewed distribution: tail to the right; typically Mean > Median.

    • Symmetric distribution: Mean ≈ Median.

  • Rule of thumb:

    • When there is a large discrepancy between mean and median, the data are skewed and the median is often a better descriptor of the typical subject.

  • Summary statement from slide:

    • Mean and median relationship helps diagnose skewness in a distribution.

Measurements of Variability

  • Rationale: A measure of central tendency does not capture how spread out the data are. Variability measures provide a fuller picture.

  • Example to illustrate variability: two patients with SBP readings have the same mean but different variability (see slides’ table).

Range (Extent)

  • Definition: Range = max(x) − min(x).

  • Robustness: NOT robust to outliers.

  • R in-note:

  x <- c(7,4,4,5)
  max(x) - min(x)
  # [1] 3

Standard Deviation (and Variance)

  • Variance definition (sample): s^2 = \frac{\sum{i=1}^{n} (xi - \bar{x})^2}{n-1}

  • Standard deviation: s = \sqrt{s^2}

  • Relationship to variance: The standard deviation is the square root of the variance.

  • Computation notes:

    • The sample standard deviation is typically used with samples: divide by (n − 1).

    • In R:

  x <- c(7,4,4,5)
  y <- c(1,1,1,1)
  sd(x)  # [1] 1.414214
  sd(y)  # [1] 0
  • Importance: Std. dev. is one of the most important dispersion measures, but it is not robust to outliers.

Quartiles and Interquartile Range (IQR)

  • Quartiles:

    • Q1: 25th percentile (25% of data below, 75% above).

    • Q3: 75th percentile (75% below, 25% above).

  • Calculation: Multiple methods exist; this course uses R's default method.

  • IQR:

    • Definition: IQR = Q3 − Q1

    • Robustness: The IQR is robust to outliers.

  • Example (death data):

    • summary(death) yields:

    • Min = 0.600, 1st Qu. = 2.300, Median (Q2) = 3.400, Mean = 3.312, 3rd Qu. = 4.200, Max = 6.100

    • Therefore:

    • Q1 = 2.300, Q3 = 4.200

    • IQR = 4.200 − 2.300 = 1.9

The Box Plot and the 5-Number Summary

  • Five-number summary: Min, Q1, Median (Q2), Q3, Max.

  • In R, the boxplot presents this summary graphically.

  • Box plot example (time until death):

    • Min = 0.6, Q1 = 2.3, Median = 3.4 (Q2), Q3 = 4.2, Max = 6.1

  • Box plot interpretation:

    • Box bounds indicate Q1 and Q3; the line inside the box marks the median; whiskers extend to Min and Max (within data range, not just IQR).

Outliers in Box Plots

  • Outlier rule (increased robustness):

    • Any value < Q1 − 1.5 × IQR or > Q3 + 1.5 × IQR is an outlier.

    • Outliers are typically shown as circles or asterisks on the box plot.

  • Important caution: Outliers should be investigated, not automatically deleted; they can be data entry errors, measurement issues, or true extreme values.

  • Example (conceptual): adding 10 to the data set can produce a much more extreme value, illustrating why outliers merit scrutiny.

Data Visualization

  • Numeric data: histograms bin numeric values to show frequency within each class.

    • Command example: hist(death, main = "Time until Death")

    • Histograms illustrate data distribution shape and potential skewness.

  • Categorical data: bar plots summarize distribution across categories.

    • Example:

  sex <- c("M","M","F","F","F","M","F","F","F")
  mytable <- table(sex)
  barplot(mytable, main = "Gender Distribution")
  • Note: For more sophisticated visuals, ggplot2 can be used, but the course demonstrates basic functions.

Week 2 Homework and Activities

  • Homework overview:

    • Under Additional Resources in Week 2, Biostatistics Practice Quiz #1 dataset is available.

    • The Practice Quiz #1 can be taken in Canvas Week 3 Module with unlimited attempts and with answers provided.

    • Questions 12, 13, 14, 15, 16, and 20 cover material to be discussed in Week 3 AM; review and think about answers.

    • At the end of Week 3 AM, discuss Practice Quiz problems in class if desired.

    • The Week 4 PM session will include the actual quiz, consisting of 20 multiple-choice questions focusing on the same concepts; you will need to use R for some questions.

R Lab & HDR Lab Reminders

  • BRFSS23 health condition: select and read BRFSS23 into R; practice filtering by geographic area (use _MMSA).

  • Familiarize yourself with dataset variables and their types.

  • First R lab session scheduled for next week (Wednesday PM) with an R Lab Assignment and HDR work.

R Markdown: Short Practice Activity

  • Task flow:
    1) Download Framingham Dataset from Canvas Week 2 resources.
    2) Determine Mean, Median, and Std. Dev of BMI.
    3) Assess if BMI is skewed.
    4) Determine if BMI has outliers.
    5) Create a histogram of BMI.
    6) Create a Bar Plot of the variable "diabetes" and compute what percent of patients have diabetes.

R Markdown: Short Activity Details

  • Activity 1: Import Framingham Dataset; examine for misread variables; convert types as needed (e.g., cigsPerDay should be numeric).

    • Note: Any missing data is coerced to NA.

  • Activity 2: Compute Mean, Median, and Std. Dev of BMI with na.rm = TRUE.

    • Sample results (from the provided outputs):

    • Mean(BMI) ≈ 25.8008

    • Median(BMI) ≈ 25.4

    • SD(BMI) ≈ 4.07984

  • Activity 3: Is BMI skewed? Interpretation: Mean > Median indicates right skew; larger sample sizes tend to attenuate the influence of outliers on the mean.

  • Activity 4: Determine if BMI has outliers via a boxplot; interpretation: BMI shows many large values consistent with obesity; boxplot reveals extreme values as outliers.

    • R: boxplot(framingham$BMI, main = "BMI of Patients", horizontal = TRUE)

  • Activity 5: Create a histogram of BMI; interpretation: Histogram confirms right skew and potential outliers.

  • Activity 6: Create a Bar Plot of Diabetes; question: What percent have diabetes?

    • R:

  mytable <- table(framingham$diabetes)
  barplot(mytable)
  • To compute percent: (freq / sum(freq)) * 100 for the diabetes categories.

R Markdown: Short Activity 2 Details

  • Activity 2: Import the Framingham dataset; fix data types as needed. Example fix for cigsPerDay:

  framingham$cigsPerDay <- as.numeric(framingham$cigsPerDay)
  • Note: Missing data is coerced to NA automatically if not handled.

R Markdown: Short Activity 3 Details (BMI Statistics)

  • Code and results:

  framingham$BMI <- as.numeric(framingham$BMI)
  mean(framingham$BMI, na.rm = TRUE)
  median(framingham$BMI, na.rm = TRUE)
  sd(framingham$BMI, na.rm = TRUE)
  • Result example (from provided output): mean ≈ 25.8008, median ≈ 25.4, sd ≈ 4.07984

    • Interpretation:

  • BMI is technically right-skewed since Mean > Median, especially with large N; despite similar mean and median values, a large sample size dampens outlier effects on the mean.

R Markdown: Short Activity 4–6 (Visualizations and Diabetes)

  • Activity 4: Boxplot for BMI to assess outliers:

  boxplot(framingham$BMI, main = "BMI of Patients", horizontal = TRUE)
  • Conclusion: BMI has many large and extreme values; outliers are visible as points outside whiskers.

    • Activity 5: Histogram of BMI:

  hist(framingham$BMI, main = "BMI of Patients")
  • Conclusion: Right-skewed distribution with potential outliers.

    • Activity 6: Bar Plot for Diabetes:

  mytable <- table(framingham$diabetes)
  barplot(mytable)
  • Question: What percent have diabetes? Calculation: (frequency of diabetes) / (total observations) × 100.

Practical Takeaways and Connections

  • When describing public health datasets, always report both a location (central tendency) and a measure of spread (variability).

  • For skewed data or data with outliers, rely more on the median and IQR than the mean and standard deviation.

  • Visualizations (histograms, box plots, bar plots) are essential for understanding distribution shape, skewness, and outliers.

  • R is used throughout for computing statistics and generating visuals; familiarity with basic functions (mean, median, sd, IQR, boxplot, hist, barplot, table) is important.

Notable Formulas and Key References

  • Mean: \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi

  • Range: \text{Range} = \max(xi) - \min(xi)

  • Variance (sample): s^2 = \frac{\sum{i=1}^{n} (xi - \bar{x})^2}{n-1}

  • Standard deviation: s = \sqrt{s^2}

  • Quartiles and IQR: \text{IQR} = Q3 - Q1

  • Outlier rule: values outside [Q1 - 1.5 \cdot IQR, \ Q3 + 1.5 \cdot IQR] are considered outliers.

  • Box plot five-number summary: Min, Q1, Median, Q3, Max.

  • Skewness interpretation:

    • Right-skew: Mean > Median; data stretched to the right.

    • Left-skew: Mean < Median; data stretched to the left.

    • Symmetric: Mean ≈ Median.

Session Note

  • Looking ahead: Dr. Michael Swain session in 3420 CCCB to discuss natural history of disease, NLM Encyclopedia, and Zotero for reference management.