Describing Variation & Distribution of Data
A measure of a single characteristic that can vary
Biologic Differences
Genes
Nutrition
Environmental
Exposures
Age
Sex
Race
Presence or absence of disease and extent of disease
Ex. Cancer of the cervix may be in situ, localized, invasive, or metastatic
Different conditions of measurement
Often account for the variations observed in medical data
Factors: time of the day, ambient temperature or noise, and the presence of fatigue or anxiety in the patient
Different techniques of measurement
Can produce different results
Measurement error
Can also cause variation
Systematic Error
Can distort data systematically in one direction.
Can introduce bias
Random Error
Does not introduce bias
Numbers and measurement
Generally use words
Nominal Variables
Naming or categoric variables that are not based on measurement scales or rank order.
Ex. Blood groups, occupations, skin color
Dichotomous (Binary) Variables
Variables with only two levels
Ex. Study of heart murmurs (systolic or diastolic)
Ordinal (Ranked) Variables
Data that can be characterized in terms of three or more qualitative values
Ex. Satisfaction of care
Continous (Dimensional) Variables
Continous scales
Observation differs over time
Ex. Height, Weight, Blood pressure
Ratio Variables
If a continous scale has true 0 point
Ex. Kelvin Temperature
Frequency Distributions of Continuous Variable
Can be shown by creating a table that lists the values of the variable according to the frequency with which the value occurs.
Range of a variable
Range is the distance between the lowest and highest observations of the variable.
Real and Theoretical Frequency Distributions
Real Frequency Distributions
Obtained from actual data or sample
Theoretical Frequency Distributions
Calculated using assumptions about the population from which the sample was obtained
Normal Distribution
Also called the Gaussian distribution after Johan Karl Gauss
Bell-shaped curve
Parameters of a Frequency Distribution
Measures of Central Tendency
Mean (x̄) – Average value
Median – Middlemost or halfway value
Mode – Most frequent value
Measures of Dispersion
Based on Percentiles
Percentile of Distribution
A point which at which a certain percentage of the observations lie below the indicated point when all the observations are ranked in descending order.
Based on Mean
Mean Absolute Deviation
Seldom used, but helps define the concept of dispersion
Does not have mathematical properties (as based form many statistical tests)
Variance
Fundamental measure of dispersion
Standard Deviation
Square root of the variance
Used to describe the amount of spread in the frequency distribution
Average of deviations from the mean
Skewness
A horizontal stretching of a frequency distribution to one side or the other, so that one tail of observations is longer and has more observations than the other tail
Skewed to the left
When histogram or a frequency polygon has a longer tail on the left side of the diagram
Negatively skewed distribution
Skewed to the right
When histogram or a frequency polygon has a longer tail on the right side of the diagram
Positvely skewed distribution
Kurtosis
Characterized by a vertical stretching or flattening of the frequency distribution
Leptokurtic: Distribution with heavy tails.
Platykurtic: Distribution with light tails.
Mesokurtic: Distribution with moderate tails, similar to a normal distribution.
Graphs provide a visual way to understand the distribution and variation in the data.
Histogram: A bar graph that shows the frequency of data points within specified ranges (bins).
Box Plot (Box-and-Whisker Plot): Displays the median, quartiles, and potential outliers. It helps visualize the spread and skewness of the data.
Dot Plot: Shows individual data points and their frequency.
Stem-and-Leaf Plot: Similar to a histogram but retains the original data values.
Density Plot: A smoothed version of the histogram, often used to estimate the probability density function of the data.
Combining various descriptive statistics provides a comprehensive overview of the data.
Five-Number Summary: Consists of the minimum, Q1, median, Q3, and maximum.
Summary Table: Includes mean, median, mode, range, variance, standard deviation, and other relevant statistics.
Outliers are data points that significantly differ from the rest of the dataset.
Detection: Using methods such as the IQR (1.5*IQR rule) or Z-scores.
Impact: Outliers can skew the results and give a misleading picture of the data distribution.
Comparing different datasets involves looking at their central tendency, spread, and shape.
Side-by-Side Box Plots: Useful for comparing the spread and central tendency of multiple groups.
Multiple Histograms: Placing histograms side by side or overlaying them for comparison.
Summary Statistics Comparison: Comparing means, medians, ranges, and standard deviations.
A measure of a single characteristic that can vary
Biologic Differences
Genes
Nutrition
Environmental
Exposures
Age
Sex
Race
Presence or absence of disease and extent of disease
Ex. Cancer of the cervix may be in situ, localized, invasive, or metastatic
Different conditions of measurement
Often account for the variations observed in medical data
Factors: time of the day, ambient temperature or noise, and the presence of fatigue or anxiety in the patient
Different techniques of measurement
Can produce different results
Measurement error
Can also cause variation
Systematic Error
Can distort data systematically in one direction.
Can introduce bias
Random Error
Does not introduce bias
Numbers and measurement
Generally use words
Nominal Variables
Naming or categoric variables that are not based on measurement scales or rank order.
Ex. Blood groups, occupations, skin color
Dichotomous (Binary) Variables
Variables with only two levels
Ex. Study of heart murmurs (systolic or diastolic)
Ordinal (Ranked) Variables
Data that can be characterized in terms of three or more qualitative values
Ex. Satisfaction of care
Continous (Dimensional) Variables
Continous scales
Observation differs over time
Ex. Height, Weight, Blood pressure
Ratio Variables
If a continous scale has true 0 point
Ex. Kelvin Temperature
Frequency Distributions of Continuous Variable
Can be shown by creating a table that lists the values of the variable according to the frequency with which the value occurs.
Range of a variable
Range is the distance between the lowest and highest observations of the variable.
Real and Theoretical Frequency Distributions
Real Frequency Distributions
Obtained from actual data or sample
Theoretical Frequency Distributions
Calculated using assumptions about the population from which the sample was obtained
Normal Distribution
Also called the Gaussian distribution after Johan Karl Gauss
Bell-shaped curve
Parameters of a Frequency Distribution
Measures of Central Tendency
Mean (x̄) – Average value
Median – Middlemost or halfway value
Mode – Most frequent value
Measures of Dispersion
Based on Percentiles
Percentile of Distribution
A point which at which a certain percentage of the observations lie below the indicated point when all the observations are ranked in descending order.
Based on Mean
Mean Absolute Deviation
Seldom used, but helps define the concept of dispersion
Does not have mathematical properties (as based form many statistical tests)
Variance
Fundamental measure of dispersion
Standard Deviation
Square root of the variance
Used to describe the amount of spread in the frequency distribution
Average of deviations from the mean
Skewness
A horizontal stretching of a frequency distribution to one side or the other, so that one tail of observations is longer and has more observations than the other tail
Skewed to the left
When histogram or a frequency polygon has a longer tail on the left side of the diagram
Negatively skewed distribution
Skewed to the right
When histogram or a frequency polygon has a longer tail on the right side of the diagram
Positvely skewed distribution
Kurtosis
Characterized by a vertical stretching or flattening of the frequency distribution
Leptokurtic: Distribution with heavy tails.
Platykurtic: Distribution with light tails.
Mesokurtic: Distribution with moderate tails, similar to a normal distribution.
Graphs provide a visual way to understand the distribution and variation in the data.
Histogram: A bar graph that shows the frequency of data points within specified ranges (bins).
Box Plot (Box-and-Whisker Plot): Displays the median, quartiles, and potential outliers. It helps visualize the spread and skewness of the data.
Dot Plot: Shows individual data points and their frequency.
Stem-and-Leaf Plot: Similar to a histogram but retains the original data values.
Density Plot: A smoothed version of the histogram, often used to estimate the probability density function of the data.
Combining various descriptive statistics provides a comprehensive overview of the data.
Five-Number Summary: Consists of the minimum, Q1, median, Q3, and maximum.
Summary Table: Includes mean, median, mode, range, variance, standard deviation, and other relevant statistics.
Outliers are data points that significantly differ from the rest of the dataset.
Detection: Using methods such as the IQR (1.5*IQR rule) or Z-scores.
Impact: Outliers can skew the results and give a misleading picture of the data distribution.
Comparing different datasets involves looking at their central tendency, spread, and shape.
Side-by-Side Box Plots: Useful for comparing the spread and central tendency of multiple groups.
Multiple Histograms: Placing histograms side by side or overlaying them for comparison.
Summary Statistics Comparison: Comparing means, medians, ranges, and standard deviations.