Descriptive Statistics and Measures of Central Tendency

Course: Descriptive Statistics
Institution: Western University Canada
Week: 3 of the course
Focus Areas:
- Measures of central tendency
- Standard deviation
- Visualizing central tendency and range
- Distributions

Mode:
- Definition: The mode is the most frequently occurring value in a dataset.
- Types:
  - Bimodal: Dataset with two modes.
  - Multimodal: Dataset with more than two modes.
Median:
- Definition: The median is the middle value in a sorted list of numbers, effectively dividing the dataset into two equal halves.
Mean (Average):
- Definition: The mean is calculated by summing all values in the dataset and dividing by the total number of observations.

Linearity and Correlation Coefficient (r):
- Ranges from +1 to -1, indicating the strength and direction of a linear relationship between two variables.
Values Interpretation:
- +1: Perfect positive correlation
- 0: No correlation
- -1: Perfect negative correlation

Definition: Central tendency indicates the center or typical value of a dataset, revealing where data points tend to cluster.
Dependence on Distribution: The measure of central tendency varies based on the nature of the data distribution.
Bell Curve (Normal Distribution):
- Characteristics:
  - Symmetrical distribution where the mean, median, and mode are equal and located at the midpoint.
  - Approximately 68% of values fall within one standard deviation of the mean, and about 95% fall within two standard deviations.

Definition: Dispersion describes how data varies from the central tendency.
Key Terms:
- Range:
  - Definition: The range is the difference between the highest and lowest values in the dataset.
  - Calculation:
    $\text{Range} = \text{Highest value} - \text{Lowest value}$
  - Example Calculation: Given values 2, 2, 3, 5, 5, 7, 8:
    - Highest = 8
    - Lowest = 2
    - Calculation:
      $8 - 2 = 6$
    - Thus, the range is 6.
- Standard Deviation:
  - Definition: The standard deviation measures the average distance of each value from the mean, indicating how spread out the values are in the dataset.
  - Characteristics:
  - A larger standard deviation indicates greater variability in the data.
  - Calculation Steps:
  1. Calculate the mean.
  2. Find the average of the squared differences from the mean.
  3. Take the square root of this average.
  - Note on Sample vs Population:
    - When calculating standard deviation, using $n-1$ instead of $n$ corrects for underestimation of the population standard deviation by producing a slightly larger result.

Heights of dogs at the shoulder (in mm): 600, 470, 170, 430, and 300.
Step 1: Calculate the Mean
- Mean Height Calculation:
  - Total number of dogs (n) = 5.
  - Calculation of mean from the given heights.

Calculating each dog's difference from the mean:
- Height data: 600, 470, 170, 430, 300
- Differences calculation:
  - Individual results: 206, 76, -224, 36, -94.
  - Total number of dogs: 5.

Standard deviation illustrates the typical variation from the average height of dogs.
It also provides insight into the normality of the data set, identifying values within one standard deviation (± 164 mm from the mean).

Importance of Normal Distribution:
- Many variables exhibit normal distribution patterns.
- Normality assumptions are critical for inferential statistics and hypothesis testing.
Characteristics:
- For a unimodal variable, approx. 68.2% of data will be within 1 SD of the mean, 94.5% within 2 SD, and 99.6% within 3 SD.
Key Symbols:
- $\mu$ = mean (population)
- $\bar{x}$ or $M$ = mean (sample)
- $\sigma$ = standard deviation (population)
- $SD$ = standard deviation (sample)

Statistical Tests:
- Z-scores can compare observed vs. expected values.
- Confidence intervals help estimate the range of expected means.
- Hypothesis testing relies on normal distribution data.assumptions.
Z-Scores:
- Definition: Z-scores measure how many standard deviations a value is from the mean.
- Usage: Standardized statistics useful for comparing different data sets.
- Interpretation:
  - Z-score of 0 indicates the mean.
  - Positive Z-scores indicate a value above the mean.
  - Negative Z-scores indicate a value below the mean.
  - Z-scores categorize into standard deviations, e.g., ±1 SD encompasses 68% of data, ±2 SD encompasses 95%, and ±3 SD encompasses 99.7%.

Considerations: Data are not always normally distributed.
Examples:
- Skewness:
  - Definition: An asymmetrical distribution where tails differ in length.
  - Types:
  - Positive skew (longer right tail).
  - Negative skew (longer left tail).
- Kurtosis:
  - Indicates the peakedness or flatness of a distribution.

Probability: Reflects the chance of a specific outcome occurring.
Probability Value (P-value): Indicates the likelihood of observed results being due to chance.
- Interpretation of P-values:
  - Small P-values: Results unlikely due to chance, potentially indicating meaningful data.
  - Large P-values: Results likely due to chance, suggesting data may not be significant.
Probability Distribution: Graphical representation of probabilities of outcomes rather than frequencies.

Variability in sampling can yield different results from the same population.
Measurement errors can arise from inaccuracies in tools, methods, or human factors.
Model assumptions may distort analysis if they don't reflect actual events.

Range: Measures dispersion within a dataset.
Standard Deviation: Average distance from the mean.
Normality: Important for statistical analysis but may not always be present (consider skewness/kurtosis).
Uncertainty: An inherent aspect of data quantifiable through probability.