Untitled Notes

Statistics 1A: Describing Data with Numerical Measures

Graphical methods may not always be sufficient for describing data.
Numerical measures can be created for both populations and samples.
- Parameter: A numerical descriptive measure calculated for a population.
- Statistic: A numerical descriptive measure calculated for a sample.

A central location statistic provides a single number that indicates the sense of the concentration of data values in a sample.

The mean, or arithmetic average, is a frequently utilized measure of the center for a set of numbers, often referred to as the sample mean.
Definition: The sample mean of observations $x1, x2, …, xn$ is given by: \bar{x} = \frac{1}{n} \sum{i=1}^{n} x_i
Advantages:
- Uses all data values.
Disadvantages:
- Only valid for numeric variables.
- Distorted by outliers.
Humorous Illustration: "My girlfriend dropped me – she said I am AVERAGE. But I think she was just being MEAN!"

The median is not influenced by outliers but is only appropriate for numeric data.
Calculating Mean and Median for different heart rates:
- 19-year-old patients: 108, 68, 80, 83, 72.
- 55-year-old patients: 86, 86, 92, 100, 112, 116, 136, 140.
The median can be calculated as follows:
- If $n$ is odd, the median is the middle value.
- If $n$ is even, the median is the average of the two middle values.

Generally, the population mean and median will not be identical.
Skewness: If the population distribution is positively or negatively skewed, then:
mean ≠ median
Important considerations for making inferences based on population characteristics involve deciding which characteristic (mean or median) is more relevant.

The median divides the data set into two equal parts.
Quartiles: Divide the data set into four equal parts:
- First quartile ($Q_1$): 25th percentile
- Second quartile ($Q_2$): Median (50th percentile)
- Third quartile ($Q_3$): 75th percentile
Percentiles: For finer measures, percentiles divide the data into 100 parts. E.g., the 99th percentile separates the highest 1% from the bottom 99%.

The trimmed mean excludes the first $k$ and last $(n-k)$ order statistics to reduce the impact of outliers.
Robustness: Trimmed means are not unduly affected by extreme values.
Example: Judges' scores in sports where extreme scores are discarded before calculation.

Reporting a measure of center (mean or median) gives partial information about data sets.
Samples can have the same central measures but different spreads.
- Visual Representation: Dot plots may show varying extents of spread even with identical means and medians.

Types of Measures:
- Variance
- Standard Deviation
- Interquartile Range (IQR)
- Range
- Quartile Deviation

The range is the difference between the largest and smallest sample values:
R = x{max} - x{min}
Adequate for small data sets but not comprehensive.

The Interquartile Range (IQR) is defined as:
IQR = Q3 - Q1
Where $Q3$ is the median of the upper half and $Q1$ is the median of the lower half of the data set.

Population Variance ($\sigma^2$) and Sample Variance ($s^2$):
- Population variance formula:
  \sigma^2 = \frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2
- Sample variance formula:
  s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2
Standard Deviation is the square root of variance.

If $y = cx + d$, where $c$ is a constant:
- Sample Variance of $y$:
  sy^2 = c^2 sx^2

A boxplot is based on measures that remain stable in the presence of a few outliers, specifically the median and a measure of spread known as the fourth spread.
Definitions for Boxplots:
- Lower Fourth: Median of the smallest half.
- Upper Fourth: Median of the largest half.
- Fourth Spread ($fs$):
  fs = upper hfourth - lower hfourth

Outlier Definition: Any observation farther than 1.5 times the fourth spread ($1.5fs$) from the closest fourth is considered an outlier.
- Extreme Outlier: More than $3fs$ from the nearest fourth.
- Mild Outlier: Within the range of $1.5fs$ from the nearest fourth.

Characteristics of distributions based on boxplot structure:
- Symmetric Distribution: Median line in center of box and whiskers of equal length.
- Skewed Right: Median line left of center and long right whisker.
- Skewed Left: Median line right of center and long left whisker.

Used effectively to reveal similarities and differences between two or more data sets concerning the same variable.
- Example: Box plots of marks from 450 students across three classes can illustrate their comparative performance.