Unit 2: Summarizing & Describing Data

Bar Graph
- Horizontal Axis: Labelled with the value of the variable.
- Vertical Axis: Labelled with frequency (either raw number or percentage).
- Height of Bars: Represents the frequency of each category.
- Bar Characteristics:
- Bars do not need to touch each other.
- The values of the variable do not need to appear in any given order.
- Each unit can give a categorical, binary, or numerical answer.
Pie Chart
- Full Circle: Represents a total of 100%.
- Respondents may only provide one answer.
- Slices: Correspond to the frequency of each category expressed as a percentage.
- Sizes of the Slices: Proportional to the frequency.

Key types include stemplots, histograms, and cumulative frequency ogives.

Purpose:
- Shows the shape of the data and the actual values in the dataset.
How to Create:
1. Sort the data in ascending or descending order.
2. Separate each number into the stem (all but the final digit) and the leaf (the final digit).

Example Format:

Horizontal Axis: Continuous range of values for the variable.
Vertical Axis: Represents the frequency (number or percentage) corresponding to different bins.
Vertical Bars: Each bin has touching bars reflecting frequency.

Key Differences:

Histograms vs. Bar Graphs
- Histogram:
- Variable graphed is quantitative.
- The bars touch due to continuous x-axis with ordered values.
- Bar Graph:
- Variable graphed is categorical.
- Bars do not touch and x-values can be in any order.

Histograms vs. Stemplots

Both depict quantitative data but differ in size suitability; histograms are used for larger datasets.

This is a type of line graph.
Two types of cumulative frequency ogives exist:
- Less than cumulative frequency ogives: Determine how many responses or what percentage are less than a specific value.
- More than cumulative frequency ogives: Determine how many responses or what percentage are greater than a specific value.
They are constructed from frequency distributions with:
- Horizontal Axis: Continuous range of values for the variable.
- Vertical Axis: Frequency corresponding to less than or more than a specific value.

Frequency Distribution: The frequency distribution of a variable provides the possible values it can take and how often these values occur, either by frequency or percentage.
Applicable for both ungrouped data and grouped data.

Class: The first column indicating the interval of values to include in that row.
Class Limits:
- Lower Class Limit: The lowest possible value in that class.
- Upper Class Limit: The highest possible value in that class.
Class Mark (aka Midpoint): The average of each class, calculated by taking the average of the class limits.
Class Boundaries: Values that separate each class; use to avoid gaps between adjacent classes.

The class width can be calculated as the difference between two consecutive upper and lower class boundaries.

This indicates the frequency of each class as a proportion of the total frequencies; the sum of relative frequencies should equal 1.

Examine how many observations (or what proportion of observations) lie above or below a particular class boundary, plotted against the upper class boundaries.

Overall Examination Questions:
1. What is the overall shape of the distribution? (symmetric or skewed / number of peaks)
2. Are there any unusual points? (errors, outliers, or influential points)
3. Where is the center of the distribution? (mean, median, mode)
4. How much variation is there? (range, standard deviation, variance, IQR)

Number of Peaks:
- Unimodal: 1 peak
- Bimodal: 2 peaks
- Multimodal: More than 2 peaks
Skewness: Reflects symmetry level.
- Symmetric: Approximately equal tails/no longer tail.
- Right Skew (aka: positively skewed): Tail to the right.
- Left Skew (aka: negatively skewed): Tail to the left.

Mean (Arithmetic Mean):
- Formula:
  $x = \frac{ \sum{i=1}^{n} xi}{n}$, where (x_i) is the class value, and (n) is the total frequency.
Mode: The mode is the value that appears most frequently in the dataset.
- For the grouped frequency distribution, the modal class uniquely identified.
- Mode calculation involves identifying the modal class and applying the formula based on excess frequencies of adjacent classes.
Median: Defined as the middle score of a distribution.
1. Sort the data in ascending order.
2. For odd numbers of observations, the median position is defined as:
  $M = \frac{n+1}{2}$ , identifying the value at that position.
3. For even numbers of observations, the median is the average of the two middle values.

The mean is not always equal to the median; both metrics assess the center but may substantiate different interpretations based on skewness.
- Example:
- Data Set: 15, 20, -3, 2, 23, 35, 10.
- Sorted: -3, 2, 10, 15, 20, 23, 35.
- Median is 15, mean may indicate distortion by extreme values.

Range: The difference between the maximum and minimum values in the dataset.
Variance: Quantifies the degree of variation in the dataset. - Sample variance and population variance are defined as follows:
- Population Variance:
  $\sigma^2 = \frac{\sum_{i=1}^{N} (x - \mu)^2}{N}$
- Sample Variance:
  $s^2 = \frac{\sum_{i=1}^{n} (x - \bar{x})^2}{n - 1}$ where (N) is the population size, (n) is the sample size.
Standard Deviation: The square root of the variance; represents the dispersion of data points in a dataset and indicates the average distance from the mean.

The pth percentile is the value of the variable such that p% of the data/observations fall below or above it.
- 25th Percentile = First Quartile = Q1
- 50th Percentile = Median = M
- 75th Percentile = Third Quartile = Q3.
5-Number Summary: Includes the minimum, Q1, median, Q3, and maximum values; useful in describing distributions, particularly for non-symmetric datasets.

A boxplot visualizes the 5-number summary, highlighting the median and the interquartile range (IQR), which encapsulates the middle 50% of the data points.

Outliers are data points significantly larger or smaller than the other points in the dataset. The modified boxplot addresses these by marking or excluding extreme values.

For skewed distributions with outliers, use the median and interquartile range for a robust summary.
For symmetric distributions without outliers, mean and standard deviation are effective.

When comparing two or more groups, consider using:
1. Side-by-side boxplot
2. Back-to-back stemplot.