Descriptive Biostatistics

Descriptive Biostatistics Lectures Notes

Lecture Outline

  • Descriptive Measures
    • Measures of Central Tendency
    • The Mean
    • The Median
    • The Mode
    • Data Distribution (symmetric and skewed distribution)
    • Measures of Dispersion
    • The Range
    • The Variance
    • The Standard Deviation
    • The Coefficient of Variation
    • The Percentiles
    • The Interquartile Range
    • Outliers
    • Kurtosis
    • Grouped Data: The Frequency Distribution
    • Graphic Methods

Descriptive Biostatistics

  • Importance of Data Organization:
    • Data should be summarized and organized effectively to facilitate analysis.
    • Raw data refers to measurements that have not been organized or summarized.

Descriptive Measures

  • Definition:
    • Descriptive measures summarize the data with a single number.
    • These measures can be derived from either sample data or population data.
    • When computed from a sample, they are called statistics; when from a population, they are termed parameters.

Types of Descriptive Measures

  • Two critical types of descriptive measures:
    1. Measures of Central Tendency
    2. Measures of Dispersion

Measures of Central Tendency

  • Definition:
    • Central tendency measures are statistical measures that determine a single score originating from a distribution's center.
    • Aim: To identify a score most representative of the entire group.
    • Common measures include:
    1. Mean
    2. Median
    3. Mode

The Mean

  • Types of Mean:
    1. Arithmetic Mean
    2. Geometric Mean
    3. Harmonic Mean
  • Arithmetic Mean:
    • Commonly referred to as the average, denoted by x.
    • It is calculated as:
      x = \frac{\sum{i=1}^{n} Xi}{n}

The Sample Mean

  • A sample of 10 students' hours spent online:
    • Data: 20, 7, 12, 5, 33, 14, 8, 0, 19, 22
    • To calculate the sample mean:
      x = \frac{20 + 7 + 12 + 5 + 33 + 14 + 8 + 0 + 19 + 22}{10}

Example - Birth-Weight Sample

  • Birth-weight data (g) of infants:
    • Calculation of the sample mean of a one-week sample:
      x = \frac{\sum{j=1}^{20} xj}{20} = \frac{3265 + 3260 + 2834 +…}{20} = 3166.9 g

Limitations of the Mean

  • Sensitivity to Extreme Values:
    • The arithmetic mean may not represent the majority of sample points adequately due to its sensitivity to outliers.

Example of Mean Limitation

  • If a weight of a premature infant was included:
    • Original mean = 3265g, adjusted mean with a 500g infant = 3028.7g.
    • Demonstrates poor representation due to extreme outlier influence.

Properties of the Mean

  1. Uniqueness:
    • Only one mean exists for a dataset.
  2. Simplicity:
    • Calculating the mean is straightforward.
  3. Affected by Extreme Values:
    • All data points influence the mean, making it easily distorted by outliers.

The Median

  • Definition:
    • The median divides the data into two equal halves.
  • Calculation Variances:
    • For an odd sample size: Median is the middle value.
    • For an even sample size: Median is the average of the two middle values.

Example of Median Calculation

  • Sample Ordered Set: 2069, 2581, …, 4146
    • With 20 values (even), median is calculated by averaging the 10th and 11th values:
      Median = \frac{3245 + 3248}{2} = 3246.5 g

Strengths of the Median

  • Insensitive to outliers, thus better represents data with extreme values compared to the mean.

Weaknesses of the Median

  • Less sensitive to variations in values outside the central two points.

Data Distributions

  • Classified as symmetric or asymmetric:
    • Symmetric: Mirror images on either side of the center.
    • Asymmetric: Distribution is skewed.

Types of Skewness

  1. Positively Skewed Distribution:
    • Tail extends to the right. Mean > Median
    • Example: Years of Oral Contraceptive use.
  2. Negatively Skewed Distribution:
    • Tail extends to the left. Mean < Median
    • Example: Humidity observations.

Relationship between Mean and Median

  • Symmetric Distributions: Mean ≈ Median.
  • Positively skewed: Mean > Median.
  • Negatively skewed: Mean < Median.

The Mode

  • Definition:
    • The mode is the most frequently occurring value in a dataset.
  • Classifying distributions by mode: unimodal, bimodal, trimodal, etc.

Examples of Mode Calculation

  1. Mode in the White Blood Count example: Occurs at 8000.
  2. In a population of ages, mode was observed at age 53 occurring 17 times.
  3. In another age distribution, mode was absent since all values were unique.

Histograms Illustrating Skewness

  • Visualizes frequency distributions to discern skewness.
  • Distribution characteristics reviewed include:
    • No Skew: Mean = Median = Mode
    • Right Skew: Mean > Median
    • Left Skew: Mean < Median

Measures of Spread or Dispersion

  • Terms synonymous include variation, spread, and scatter.
  • A measure of dispersion conveys how variable data is.

The Range

  • Definition:
    • The range is defined as: R = xL - xS where:
    • xL = maximum value, xS = minimum value.
  • Limitations:
    • Sensitive to extreme observations and provides limited information.

The Variance (s²)

  • Definition:
    • Variance quantifies dispersion relative to the mean, represented as:
      s² = \frac{\sum (x_i - \overline{x})^2}{n - 1}

Degrees of Freedom

  • Variance divides by n - 1 instead of n due to degrees of freedom concept.

Standard Deviation (s)

  • Definition:
    • The standard deviation is the square root of the variance:
      s = \sqrt{s^2}

Coefficient of Variation (CV)

  • Represents the standard deviation as a percentage of the mean:
    CV = \frac{s}{\overline{x}} \times 100
  • Useful for comparing relative variation across different datasets.

Percentiles

  • Definition:
    • Percentiles divide data into 100 equal parts and have advantages over range due to reduced sensitivity to outliers.
  • Percentiles are not dramatically influenced by sample size, defined as:
    • For pth Percentile, P_p denotes a value with a percentage of observations below it.

Interquartile Range (IQR)

  • Defined as the difference between the first (Q1) and the third quartiles (Q3): IQR = Q3 - Q1
    • More informative regarding middle 50% variability compared to range.

Outliers or Outlying Values

  • Definition:
    • Values significantly higher or lower than the distribution:
    • x > Q3 + 1.5(Q3 - Q1)
    • x < Q1 - 1.5(Q3 - Q1)
    • Example calculations provided based on upper and lower quartiles.

Kurtosis

  • Describes the flatness or peakedness of a distribution:
    • Mesokurtic: Normal distribution
    • Platykurtic: Flatter than normal
    • Leptokurtic: Taller/ peaked than normal

Data Representation Methods

  • Ordered Array:
    • Data arranged from smallest to largest.
  • Grouped Data:
    • Simplified overview via summary tables and frequency distributions.
  • Graphic Methods:
    • Examples: Histograms, Frequency Polygons, Stem-and-Leaf Displays, Box-and-Whisker Plots.