Descriptive Biostatistics
Descriptive Biostatistics Lectures Notes
Lecture Outline
- Descriptive Measures
- Measures of Central Tendency
- The Mean
- The Median
- The Mode
- Data Distribution (symmetric and skewed distribution)
- Measures of Dispersion
- The Range
- The Variance
- The Standard Deviation
- The Coefficient of Variation
- The Percentiles
- The Interquartile Range
- Outliers
- Kurtosis
- Grouped Data: The Frequency Distribution
- Graphic Methods
Descriptive Biostatistics
- Importance of Data Organization:
- Data should be summarized and organized effectively to facilitate analysis.
- Raw data refers to measurements that have not been organized or summarized.
Descriptive Measures
- Definition:
- Descriptive measures summarize the data with a single number.
- These measures can be derived from either sample data or population data.
- When computed from a sample, they are called statistics; when from a population, they are termed parameters.
Types of Descriptive Measures
- Two critical types of descriptive measures:
- Measures of Central Tendency
- Measures of Dispersion
Measures of Central Tendency
- Definition:
- Central tendency measures are statistical measures that determine a single score originating from a distribution's center.
- Aim: To identify a score most representative of the entire group.
- Common measures include:
- Mean
- Median
- Mode
The Mean
- Types of Mean:
- Arithmetic Mean
- Geometric Mean
- Harmonic Mean
- Arithmetic Mean:
- Commonly referred to as the average, denoted by x.
- It is calculated as:
x=n∑<em>i=1nX</em>i
The Sample Mean
- A sample of 10 students' hours spent online:
- Data: 20, 7, 12, 5, 33, 14, 8, 0, 19, 22
- To calculate the sample mean:
x=1020+7+12+5+33+14+8+0+19+22
Example - Birth-Weight Sample
- Birth-weight data (g) of infants:
- Calculation of the sample mean of a one-week sample:
x=20∑<em>j=120x</em>j=203265+3260+2834+…=3166.9g
Limitations of the Mean
- Sensitivity to Extreme Values:
- The arithmetic mean may not represent the majority of sample points adequately due to its sensitivity to outliers.
Example of Mean Limitation
- If a weight of a premature infant was included:
- Original mean = 3265g, adjusted mean with a 500g infant = 3028.7g.
- Demonstrates poor representation due to extreme outlier influence.
Properties of the Mean
- Uniqueness:
- Only one mean exists for a dataset.
- Simplicity:
- Calculating the mean is straightforward.
- Affected by Extreme Values:
- All data points influence the mean, making it easily distorted by outliers.
- Definition:
- The median divides the data into two equal halves.
- Calculation Variances:
- For an odd sample size: Median is the middle value.
- For an even sample size: Median is the average of the two middle values.
- Sample Ordered Set: 2069, 2581, …, 4146
- With 20 values (even), median is calculated by averaging the 10th and 11th values:
Median=23245+3248=3246.5g
- Insensitive to outliers, thus better represents data with extreme values compared to the mean.
- Less sensitive to variations in values outside the central two points.
Data Distributions
- Classified as symmetric or asymmetric:
- Symmetric: Mirror images on either side of the center.
- Asymmetric: Distribution is skewed.
Types of Skewness
- Positively Skewed Distribution:
- Tail extends to the right. Mean > Median
- Example: Years of Oral Contraceptive use.
- Negatively Skewed Distribution:
- Tail extends to the left. Mean < Median
- Example: Humidity observations.
- Symmetric Distributions: Mean ≈ Median.
- Positively skewed: Mean > Median.
- Negatively skewed: Mean < Median.
The Mode
- Definition:
- The mode is the most frequently occurring value in a dataset.
- Classifying distributions by mode: unimodal, bimodal, trimodal, etc.
Examples of Mode Calculation
- Mode in the White Blood Count example: Occurs at 8000.
- In a population of ages, mode was observed at age 53 occurring 17 times.
- In another age distribution, mode was absent since all values were unique.
Histograms Illustrating Skewness
- Visualizes frequency distributions to discern skewness.
- Distribution characteristics reviewed include:
- No Skew: Mean = Median = Mode
- Right Skew: Mean > Median
- Left Skew: Mean < Median
Measures of Spread or Dispersion
- Terms synonymous include variation, spread, and scatter.
- A measure of dispersion conveys how variable data is.
The Range
- Definition:
- The range is defined as: R=x<em>L−x</em>S where:
- x<em>L = maximum value, x</em>S = minimum value.
- Limitations:
- Sensitive to extreme observations and provides limited information.
The Variance (s²)
- Definition:
- Variance quantifies dispersion relative to the mean, represented as:
s2=n−1∑(xi−x)2
Degrees of Freedom
- Variance divides by n−1 instead of n due to degrees of freedom concept.
Standard Deviation (s)
- Definition:
- The standard deviation is the square root of the variance:
s=s2
Coefficient of Variation (CV)
- Represents the standard deviation as a percentage of the mean:
CV=xs×100 - Useful for comparing relative variation across different datasets.
Percentiles
- Definition:
- Percentiles divide data into 100 equal parts and have advantages over range due to reduced sensitivity to outliers.
- Percentiles are not dramatically influenced by sample size, defined as:
- For pth Percentile, Pp denotes a value with a percentage of observations below it.
Interquartile Range (IQR)
- Defined as the difference between the first (Q1) and the third quartiles (Q3):
IQR=Q3−Q1
- More informative regarding middle 50% variability compared to range.
Outliers or Outlying Values
- Definition:
- Values significantly higher or lower than the distribution:
- x > Q3 + 1.5(Q3 - Q1)
- x < Q1 - 1.5(Q3 - Q1)
- Example calculations provided based on upper and lower quartiles.
Kurtosis
- Describes the flatness or peakedness of a distribution:
- Mesokurtic: Normal distribution
- Platykurtic: Flatter than normal
- Leptokurtic: Taller/ peaked than normal
Data Representation Methods
- Ordered Array:
- Data arranged from smallest to largest.
- Grouped Data:
- Simplified overview via summary tables and frequency distributions.
- Graphic Methods:
- Examples: Histograms, Frequency Polygons, Stem-and-Leaf Displays, Box-and-Whisker Plots.