Descriptive Biostatistics

Descriptive Biostatistics Lectures Notes

Lecture Outline

Descriptive Measures
- Measures of Central Tendency
- The Mean
- The Median
- The Mode
- Data Distribution (symmetric and skewed distribution)
- Measures of Dispersion
- The Range
- The Variance
- The Standard Deviation
- The Coefficient of Variation
- The Percentiles
- The Interquartile Range
- Outliers
- Kurtosis
- Grouped Data: The Frequency Distribution
- Graphic Methods

Descriptive Biostatistics

Importance of Data Organization:
- Data should be summarized and organized effectively to facilitate analysis.
- Raw data refers to measurements that have not been organized or summarized.

Descriptive Measures

Definition:
- Descriptive measures summarize the data with a single number.
- These measures can be derived from either sample data or population data.
- When computed from a sample, they are called statistics; when from a population, they are termed parameters.

Types of Descriptive Measures

Two critical types of descriptive measures:
1. Measures of Central Tendency
2. Measures of Dispersion

Measures of Central Tendency

Definition:
- Central tendency measures are statistical measures that determine a single score originating from a distribution's center.
- Aim: To identify a score most representative of the entire group.
- Common measures include:
1. Mean
2. Median
3. Mode

The Mean

Types of Mean:
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean
Arithmetic Mean:
- Commonly referred to as the average, denoted by $x$ .
- It is calculated as:
  $x = \frac{\sum<em>{i=1}^{n} X</em>i}{n}$

The Sample Mean

A sample of 10 students' hours spent online:
- Data: 20, 7, 12, 5, 33, 14, 8, 0, 19, 22
- To calculate the sample mean:
  $x = \frac{20 + 7 + 12 + 5 + 33 + 14 + 8 + 0 + 19 + 22}{10}$

Example - Birth-Weight Sample

Birth-weight data (g) of infants:
- Calculation of the sample mean of a one-week sample:
  $x = \frac{\sum<em>{j=1}^{20} x</em>j}{20} = \frac{3265 + 3260 + 2834 +…}{20} = 3166.9 g$

Limitations of the Mean

Sensitivity to Extreme Values:
- The arithmetic mean may not represent the majority of sample points adequately due to its sensitivity to outliers.

Example of Mean Limitation

If a weight of a premature infant was included:
- Original mean = 3265g, adjusted mean with a 500g infant = 3028.7g.
- Demonstrates poor representation due to extreme outlier influence.

Properties of the Mean

Uniqueness:
- Only one mean exists for a dataset.
Simplicity:
- Calculating the mean is straightforward.
Affected by Extreme Values:
- All data points influence the mean, making it easily distorted by outliers.

The Median

Definition:
- The median divides the data into two equal halves.
Calculation Variances:
- For an odd sample size: Median is the middle value.
- For an even sample size: Median is the average of the two middle values.

Example of Median Calculation

Sample Ordered Set: 2069, 2581, …, 4146
- With 20 values (even), median is calculated by averaging the 10th and 11th values:
  $Median = \frac{3245 + 3248}{2} = 3246.5 g$

Strengths of the Median

Insensitive to outliers, thus better represents data with extreme values compared to the mean.

Weaknesses of the Median

Less sensitive to variations in values outside the central two points.

Data Distributions

Classified as symmetric or asymmetric:
- Symmetric: Mirror images on either side of the center.
- Asymmetric: Distribution is skewed.

Types of Skewness

Positively Skewed Distribution:
- Tail extends to the right. Mean > Median
- Example: Years of Oral Contraceptive use.
Negatively Skewed Distribution:
- Tail extends to the left. Mean < Median
- Example: Humidity observations.

Relationship between Mean and Median

Symmetric Distributions: Mean ≈ Median.
Positively skewed: Mean > Median.
Negatively skewed: Mean < Median.

The Mode

Definition:
- The mode is the most frequently occurring value in a dataset.
Classifying distributions by mode: unimodal, bimodal, trimodal, etc.

Examples of Mode Calculation

Mode in the White Blood Count example: Occurs at 8000.
In a population of ages, mode was observed at age 53 occurring 17 times.
In another age distribution, mode was absent since all values were unique.

Histograms Illustrating Skewness

Visualizes frequency distributions to discern skewness.
Distribution characteristics reviewed include:
- No Skew: Mean = Median = Mode
- Right Skew: Mean > Median
- Left Skew: Mean < Median

Measures of Spread or Dispersion

Terms synonymous include variation, spread, and scatter.
A measure of dispersion conveys how variable data is.

The Range

Definition:
- The range is defined as: $R = x<em>L - x</em>S$ where:
- $x<em>L$ = maximum value, $x</em>S$ = minimum value.
Limitations:
- Sensitive to extreme observations and provides limited information.

The Variance (s²)

Definition:
- Variance quantifies dispersion relative to the mean, represented as:
  $s² = \frac{\sum (x_i - \overline{x})^2}{n - 1}$

Degrees of Freedom

Variance divides by $n - 1$ instead of $n$ due to degrees of freedom concept.

Standard Deviation (s)

Definition:
- The standard deviation is the square root of the variance:
  $s = \sqrt{s^2}$

Coefficient of Variation (CV)

Represents the standard deviation as a percentage of the mean:
$CV = \frac{s}{\overline{x}} \times 100$
Useful for comparing relative variation across different datasets.

Percentiles

Definition:
- Percentiles divide data into 100 equal parts and have advantages over range due to reduced sensitivity to outliers.
Percentiles are not dramatically influenced by sample size, defined as:
- For pth Percentile, $P_p$ denotes a value with a percentage of observations below it.

Interquartile Range (IQR)

Defined as the difference between the first (Q1) and the third quartiles (Q3): $IQR = Q3 - Q1$
- More informative regarding middle 50% variability compared to range.

Outliers or Outlying Values

Definition:
- Values significantly higher or lower than the distribution:
- x > Q3 + 1.5(Q3 - Q1)
- x < Q1 - 1.5(Q3 - Q1)
- Example calculations provided based on upper and lower quartiles.

Kurtosis

Describes the flatness or peakedness of a distribution:
- Mesokurtic: Normal distribution
- Platykurtic: Flatter than normal
- Leptokurtic: Taller/ peaked than normal

Data Representation Methods

Ordered Array:
- Data arranged from smallest to largest.
Grouped Data:
- Simplified overview via summary tables and frequency distributions.
Graphic Methods:
- Examples: Histograms, Frequency Polygons, Stem-and-Leaf Displays, Box-and-Whisker Plots.