Descriptive Biostatistics
Descriptive Biostatistics Lectures Notes
Lecture Outline
- Descriptive Measures
- Measures of Central Tendency
- The Mean
- The Median
- The Mode
- Data Distribution (symmetric and skewed distribution)
- Measures of Dispersion
- The Range
- The Variance
- The Standard Deviation
- The Coefficient of Variation
- The Percentiles
- The Interquartile Range
- Outliers
- Kurtosis
- Grouped Data: The Frequency Distribution
- Graphic Methods
Descriptive Biostatistics
- Importance of Data Organization:
- Data should be summarized and organized effectively to facilitate analysis.
- Raw data refers to measurements that have not been organized or summarized.
Descriptive Measures
- Definition:
- Descriptive measures summarize the data with a single number.
- These measures can be derived from either sample data or population data.
- When computed from a sample, they are called statistics; when from a population, they are termed parameters.
Types of Descriptive Measures
- Two critical types of descriptive measures:
- Measures of Central Tendency
- Measures of Dispersion
Measures of Central Tendency
- Definition:
- Central tendency measures are statistical measures that determine a single score originating from a distribution's center.
- Aim: To identify a score most representative of the entire group.
- Common measures include:
- Mean
- Median
- Mode
The Mean
- Types of Mean:
- Arithmetic Mean
- Geometric Mean
- Harmonic Mean
- Arithmetic Mean:
- Commonly referred to as the average, denoted by x.
- It is calculated as:
x = \frac{\sum{i=1}^{n} Xi}{n}
The Sample Mean
- A sample of 10 students' hours spent online:
- Data: 20, 7, 12, 5, 33, 14, 8, 0, 19, 22
- To calculate the sample mean:
x = \frac{20 + 7 + 12 + 5 + 33 + 14 + 8 + 0 + 19 + 22}{10}
Example - Birth-Weight Sample
- Birth-weight data (g) of infants:
- Calculation of the sample mean of a one-week sample:
x = \frac{\sum{j=1}^{20} xj}{20} = \frac{3265 + 3260 + 2834 +…}{20} = 3166.9 g
Limitations of the Mean
- Sensitivity to Extreme Values:
- The arithmetic mean may not represent the majority of sample points adequately due to its sensitivity to outliers.
Example of Mean Limitation
- If a weight of a premature infant was included:
- Original mean = 3265g, adjusted mean with a 500g infant = 3028.7g.
- Demonstrates poor representation due to extreme outlier influence.
Properties of the Mean
- Uniqueness:
- Only one mean exists for a dataset.
- Simplicity:
- Calculating the mean is straightforward.
- Affected by Extreme Values:
- All data points influence the mean, making it easily distorted by outliers.
- Definition:
- The median divides the data into two equal halves.
- Calculation Variances:
- For an odd sample size: Median is the middle value.
- For an even sample size: Median is the average of the two middle values.
- Sample Ordered Set: 2069, 2581, …, 4146
- With 20 values (even), median is calculated by averaging the 10th and 11th values:
Median = \frac{3245 + 3248}{2} = 3246.5 g
- Insensitive to outliers, thus better represents data with extreme values compared to the mean.
- Less sensitive to variations in values outside the central two points.
Data Distributions
- Classified as symmetric or asymmetric:
- Symmetric: Mirror images on either side of the center.
- Asymmetric: Distribution is skewed.
Types of Skewness
- Positively Skewed Distribution:
- Tail extends to the right. Mean > Median
- Example: Years of Oral Contraceptive use.
- Negatively Skewed Distribution:
- Tail extends to the left. Mean < Median
- Example: Humidity observations.
- Symmetric Distributions: Mean ≈ Median.
- Positively skewed: Mean > Median.
- Negatively skewed: Mean < Median.
The Mode
- Definition:
- The mode is the most frequently occurring value in a dataset.
- Classifying distributions by mode: unimodal, bimodal, trimodal, etc.
Examples of Mode Calculation
- Mode in the White Blood Count example: Occurs at 8000.
- In a population of ages, mode was observed at age 53 occurring 17 times.
- In another age distribution, mode was absent since all values were unique.
Histograms Illustrating Skewness
- Visualizes frequency distributions to discern skewness.
- Distribution characteristics reviewed include:
- No Skew: Mean = Median = Mode
- Right Skew: Mean > Median
- Left Skew: Mean < Median
Measures of Spread or Dispersion
- Terms synonymous include variation, spread, and scatter.
- A measure of dispersion conveys how variable data is.
The Range
- Definition:
- The range is defined as: R = xL - xS where:
- xL = maximum value, xS = minimum value.
- Limitations:
- Sensitive to extreme observations and provides limited information.
The Variance (s²)
- Definition:
- Variance quantifies dispersion relative to the mean, represented as:
s² = \frac{\sum (x_i - \overline{x})^2}{n - 1}
Degrees of Freedom
- Variance divides by n - 1 instead of n due to degrees of freedom concept.
Standard Deviation (s)
- Definition:
- The standard deviation is the square root of the variance:
s = \sqrt{s^2}
Coefficient of Variation (CV)
- Represents the standard deviation as a percentage of the mean:
CV = \frac{s}{\overline{x}} \times 100 - Useful for comparing relative variation across different datasets.
Percentiles
- Definition:
- Percentiles divide data into 100 equal parts and have advantages over range due to reduced sensitivity to outliers.
- Percentiles are not dramatically influenced by sample size, defined as:
- For pth Percentile, P_p denotes a value with a percentage of observations below it.
Interquartile Range (IQR)
- Defined as the difference between the first (Q1) and the third quartiles (Q3):
IQR = Q3 - Q1
- More informative regarding middle 50% variability compared to range.
Outliers or Outlying Values
- Definition:
- Values significantly higher or lower than the distribution:
- x > Q3 + 1.5(Q3 - Q1)
- x < Q1 - 1.5(Q3 - Q1)
- Example calculations provided based on upper and lower quartiles.
Kurtosis
- Describes the flatness or peakedness of a distribution:
- Mesokurtic: Normal distribution
- Platykurtic: Flatter than normal
- Leptokurtic: Taller/ peaked than normal
Data Representation Methods
- Ordered Array:
- Data arranged from smallest to largest.
- Grouped Data:
- Simplified overview via summary tables and frequency distributions.
- Graphic Methods:
- Examples: Histograms, Frequency Polygons, Stem-and-Leaf Displays, Box-and-Whisker Plots.