Unit 2: Summarizing & Describing Data
UNIT 2: SUMMARIZING & DESCRIBING DATA
2.1. Graphical Methods for Presenting Data
2.1.1. Graphical Methods for Representing Categorical Data
Bar Graph
- Horizontal Axis: Labelled with the value of the variable.
- Vertical Axis: Labelled with frequency (either raw number or percentage).
- Height of Bars: Represents the frequency of each category.
- Bar Characteristics:
- Bars do not need to touch each other.
- The values of the variable do not need to appear in any given order.
- Each unit can give a categorical, binary, or numerical answer.
Pie Chart
- Full Circle: Represents a total of 100%.
- Respondents may only provide one answer.
- Slices: Correspond to the frequency of each category expressed as a percentage.
- Sizes of the Slices: Proportional to the frequency.
2.1.2. Graphical Methods for Representing Quantitative Data
- Key types include stemplots, histograms, and cumulative frequency ogives.
Stemplots (aka: Stem and Leaf Plots)
Purpose:
- Shows the shape of the data and the actual values in the dataset.
How to Create:
- Sort the data in ascending or descending order.
- Separate each number into the stem (all but the final digit) and the leaf (the final digit).
Example Format:
- Value | Stem | Leaf
- 7 | 0 | 7
- 10 | |
- 17 | |
- 21 | |
- 321 | 32 |
Histograms
- Horizontal Axis: Continuous range of values for the variable.
- Vertical Axis: Represents the frequency (number or percentage) corresponding to different bins.
- Vertical Bars: Each bin has touching bars reflecting frequency.
Key Differences:
- Histograms vs. Bar Graphs
- Histogram:
- Variable graphed is quantitative.
- The bars touch due to continuous x-axis with ordered values.
- Bar Graph:
- Variable graphed is categorical.
- Bars do not touch and x-values can be in any order.
Histograms vs. Stemplots
- Both depict quantitative data but differ in size suitability; histograms are used for larger datasets.
Cumulative Frequency Ogives
- This is a type of line graph.
- Two types of cumulative frequency ogives exist:
- Less than cumulative frequency ogives: Determine how many responses or what percentage are less than a specific value.
- More than cumulative frequency ogives: Determine how many responses or what percentage are greater than a specific value.
- They are constructed from frequency distributions with:
- Horizontal Axis: Continuous range of values for the variable.
- Vertical Axis: Frequency corresponding to less than or more than a specific value.
2.2. Frequency Distributions for Quantitative Data
- Frequency Distribution: The frequency distribution of a variable provides the possible values it can take and how often these values occur, either by frequency or percentage.
- Applicable for both ungrouped data and grouped data.
Ungrouped and Grouped Data Terminology
- Class: The first column indicating the interval of values to include in that row.
- Class Limits:
- Lower Class Limit: The lowest possible value in that class.
- Upper Class Limit: The highest possible value in that class.
- Class Mark (aka Midpoint): The average of each class, calculated by taking the average of the class limits.
- Class Boundaries: Values that separate each class; use to avoid gaps between adjacent classes.
Class Width (Class Size)
- The class width can be calculated as the difference between two consecutive upper and lower class boundaries.
Relative Frequency
- This indicates the frequency of each class as a proportion of the total frequencies; the sum of relative frequencies should equal 1.
Cumulative Frequency Distributions
- Examine how many observations (or what proportion of observations) lie above or below a particular class boundary, plotted against the upper class boundaries.
2.3. Numerical Methods for Describing Data
- Overall Examination Questions:
- What is the overall shape of the distribution? (symmetric or skewed / number of peaks)
- Are there any unusual points? (errors, outliers, or influential points)
- Where is the center of the distribution? (mean, median, mode)
- How much variation is there? (range, standard deviation, variance, IQR)
2.3.1. Shape
- Number of Peaks:
- Unimodal: 1 peak
- Bimodal: 2 peaks
- Multimodal: More than 2 peaks
- Skewness: Reflects symmetry level.
- Symmetric: Approximately equal tails/no longer tail.
- Right Skew (aka: positively skewed): Tail to the right.
- Left Skew (aka: negatively skewed): Tail to the left.
2.3.2. Centre (Middle)
Mean (Arithmetic Mean):
- Formula:
$x = \frac{ \sum{i=1}^{n} xi}{n}$, where (x_i) is the class value, and (n) is the total frequency.
- Formula:
Mode: The mode is the value that appears most frequently in the dataset.
- For the grouped frequency distribution, the modal class uniquely identified.
- Mode calculation involves identifying the modal class and applying the formula based on excess frequencies of adjacent classes.
Median: Defined as the middle score of a distribution.
- Sort the data in ascending order.
- For odd numbers of observations, the median position is defined as:
, identifying the value at that position. - For even numbers of observations, the median is the average of the two middle values.
Comparing Mean and Median
- The mean is not always equal to the median; both metrics assess the center but may substantiate different interpretations based on skewness.
- Example:
- Data Set: 15, 20, -3, 2, 23, 35, 10.
- Sorted: -3, 2, 10, 15, 20, 23, 35.
- Median is 15, mean may indicate distortion by extreme values.
2.3.3. Spread
Range: The difference between the maximum and minimum values in the dataset.
Variance: Quantifies the degree of variation in the dataset. - Sample variance and population variance are defined as follows:
- Population Variance:
- Sample Variance:
where (N) is the population size, (n) is the sample size.
- Population Variance:
Standard Deviation: The square root of the variance; represents the dispersion of data points in a dataset and indicates the average distance from the mean.
2.4. Quartiles and Percentiles
- The pth percentile is the value of the variable such that p% of the data/observations fall below or above it.
- 25th Percentile = First Quartile = Q1
- 50th Percentile = Median = M
- 75th Percentile = Third Quartile = Q3.
- 5-Number Summary: Includes the minimum, Q1, median, Q3, and maximum values; useful in describing distributions, particularly for non-symmetric datasets.
Boxplot
- A boxplot visualizes the 5-number summary, highlighting the median and the interquartile range (IQR), which encapsulates the middle 50% of the data points.
Outliers
- Outliers are data points significantly larger or smaller than the other points in the dataset. The modified boxplot addresses these by marking or excluding extreme values.
Distribution Description
- For skewed distributions with outliers, use the median and interquartile range for a robust summary.
- For symmetric distributions without outliers, mean and standard deviation are effective.
Comparison of Groups
- When comparing two or more groups, consider using:
- Side-by-side boxplot
- Back-to-back stemplot.