Describing Variation & Distribution of Data

Presence or absence of disease and extent of disease
- Ex. Cancer of the cervix may be in situ, localized, invasive, or metastatic

Different conditions of measurement
- Often account for the variations observed in medical data
- Factors: time of the day, ambient temperature or noise, and the presence of fatigue or anxiety in the patient

Systematic Error
- Can distort data systematically in one direction.
- Can introduce bias

Nominal Variables
- Naming or categoric variables that are not based on measurement scales or rank order.
- Ex. Blood groups, occupations, skin color

Dichotomous (Binary) Variables
- Variables with only two levels
- Ex. Study of heart murmurs (systolic or diastolic)

Ordinal (Ranked) Variables
- Data that can be characterized in terms of three or more qualitative values
- Ex. Satisfaction of care

Continous (Dimensional) Variables
- Continous scales
- Observation differs over time
- Ex. Height, Weight, Blood pressure

Ratio Variables
- If a continous scale has true 0 point
- Ex. Kelvin Temperature

Frequency Distributions of Continuous Variable
- Can be shown by creating a table that lists the values of the variable according to the frequency with which the value occurs.

Range of a variable
- Range is the distance between the lowest and highest observations of the variable.

Real and Theoretical Frequency Distributions
- Real Frequency Distributions
  - Obtained from actual data or sample
- Theoretical Frequency Distributions
  - Calculated using assumptions about the population from which the sample was obtained
- Normal Distribution
  - Also called the Gaussian distribution after Johan Karl Gauss
  - Bell-shaped curve
Parameters of a Frequency Distribution
- Measures of Central Tendency
  - Mean (x̄) – Average value
  - Median – Middlemost or halfway value
  - Mode – Most frequent value
- Measures of Dispersion
  - Based on Percentiles
    - Percentile of Distribution
      - A point which at which a certain percentage of the observations lie below the indicated point when all the observations are ranked in descending order.
  - Based on Mean
    - Mean Absolute Deviation
      - Seldom used, but helps define the concept of dispersion
      - Does not have mathematical properties (as based form many statistical tests)
    - Variance
      - Fundamental measure of dispersion
    - Standard Deviation
      - Square root of the variance
      - Used to describe the amount of spread in the frequency distribution
      - Average of deviations from the mean

Skewness

A horizontal stretching of a frequency distribution to one side or the other, so that one tail of observations is longer and has more observations than the other tail

Skewed to the left
- When histogram or a frequency polygon has a longer tail on the left side of the diagram
- Negatively skewed distribution

Skewed to the right
- When histogram or a frequency polygon has a longer tail on the right side of the diagram
- Positvely skewed distribution

Kurtosis
- Characterized by a vertical stretching or flattening of the frequency distribution
  - Leptokurtic: Distribution with heavy tails.
  - Platykurtic: Distribution with light tails.
  - Mesokurtic: Distribution with moderate tails, similar to a normal distribution.

Graphs provide a visual way to understand the distribution and variation in the data.

Histogram: A bar graph that shows the frequency of data points within specified ranges (bins).
Box Plot (Box-and-Whisker Plot): Displays the median, quartiles, and potential outliers. It helps visualize the spread and skewness of the data.
Dot Plot: Shows individual data points and their frequency.
Stem-and-Leaf Plot: Similar to a histogram but retains the original data values.
Density Plot: A smoothed version of the histogram, often used to estimate the probability density function of the data.

Combining various descriptive statistics provides a comprehensive overview of the data.

Five-Number Summary: Consists of the minimum, Q1, median, Q3, and maximum.
Summary Table: Includes mean, median, mode, range, variance, standard deviation, and other relevant statistics.

Outliers are data points that significantly differ from the rest of the dataset.

Detection: Using methods such as the IQR (1.5*IQR rule) or Z-scores.
Impact: Outliers can skew the results and give a misleading picture of the data distribution.

Comparing different datasets involves looking at their central tendency, spread, and shape.

Side-by-Side Box Plots: Useful for comparing the spread and central tendency of multiple groups.
Multiple Histograms: Placing histograms side by side or overlaying them for comparison.
Summary Statistics Comparison: Comparing means, medians, ranges, and standard deviations.