Exploring Data with Tables and Graphs - Vocabulary Flashcards
Frequency Distributions for Organizing and Summarizing Data
- Key concept: When working with large data sets, a frequency distribution (or frequency table) helps organize and summarize data and understand the nature of the distribution of the data set.
- Frequency distribution (frequency table): Shows how data are partitioned among several categories (classes) by listing the categories along with the number (frequency) of data values in each class.
Definitions (1 of 2)
- Lower class limits (LCL): The smallest numbers that can belong to each of the different classes.
- Upper class limits (UCL): The largest numbers that can belong to each of the different classes.
- Class boundaries: The numbers used to separate the classes, but without the gaps created by class limits.
Definitions (2 of 2)
- Class midpoints: The values in the middle of the classes. Each class midpoint can be found by
- Class width (or width): The difference between two consecutive lower class limits in a frequency distribution.
Procedure for Constructing a Frequency Distribution (1 of 2)
- Select the number of classes, usually between 5 and 20.
- Calculate the class width. Round this result to get a convenient number (it’s usually best to round up).
Procedure for Constructing a Frequency Distribution (2 of 2)
- Choose the value for the first lower class limit by using either the minimum value or a convenient value below the minimum.
- Using the first lower class limit and class width, list the other lower class limits.
- List the lower class limits in a vertical column and then determine and enter the upper class limits.
- Take each data value and place a tally mark in the appropriate class. Add the tally marks to obtain the frequency.
Example: McDonald’s Lunch Service Times (5 classes)
- Drive-through service times (seconds) example data: 107, 139, 197, 209, 281, 254, 163, 150, 127, 308, 206, 187, 169, 83, 127, 133, 140, 143, 130, 144, 91, 113, 153, 255, 252, 200, 117, 167, 148, 184, 123, 153, 155, 154, 100, 117, 101, 138, 186, 196, 146, 90, 144, 119, 135, 151, 197, 171, 190, 169
- Construct the five-class frequency distribution (as shown in the example):
- Time (Seconds) classes: 75-124, 125-174, 175-224, 225-274, 275-324
- Frequencies: 11, 24, 10, 3, 2
- Steps reflected in the example:
- Step 1: number of classes = 5
- Step 2: class width w = 50 (from the example: \cdots rounded up to 50)
- Step 3: first lower class limit chosen as 75 (a convenient value below the minimum 83)
- Step 4: subsequent lower class limits: 75, 125, 175, 225, 275
- Step 5: corresponding upper class limits: 124, 174, 224, 274, 324
- Step 6: tally data into classes to obtain frequencies: 11, 24, 10, 3, 2
Relative Frequency Distribution (1 of 2)
- Relative frequency distribution (or percentage frequency distribution): Replace each class frequency with a relative frequency (proportion) or a percentage.
Relative Frequency Distribution (2 of 2)
- The sum of the relative frequencies (or percentages) is very close to 1 (or 100%) accounting for rounding errors.
Cumulative Frequency Distribution
- Cumulative frequency distribution: The frequency for each class is the sum of the frequencies for that class and all previous classes.
- Example (McDonald’s lunch service times):
- Less than 125: 11
- Less than 175: 35
- Less than 225: 45
- Less than 275: 48
- Less than 325: 50
Critical Thinking: Using Frequency Distributions to Understand Data
- To assess if data are approximately normal:
1) Frequencies rise to a peak and then fall.
2) The distribution is approximately symmetric, with preceding frequencies roughly mirroring subsequent frequencies around the maximum.
Gaps in Data
- Gaps can indicate data from two or more populations. However, gaps are not definitive proof; two populations can exist without gaps.
Example: What Does a Gap Tell Us? (Pennies weights)
- Weights (grams) of pennies have a distribution with a gap between light pennies (2.40-2.99 g) and heavier pennies (3.00-3.19 g).
- Interpretation: likely two populations: pre-1983 pennies (95% copper, 5% zinc) vs post-1983 pennies (2.5% copper, 97.5% zinc).
Comparisons: Relative Frequencies Across Groups
- Combining two or more relative frequency distributions in one table facilitates comparisons.
Example: McDonald’s vs Dunkin’ Donuts - Drive-Through Times (Relative frequencies)
- Time (seconds) categories: 25-74, 75-124, 125-174, 175-224, 225-274, 275-324.
- McDonald’s relative frequencies: 0, 0.22, 0.48, 0.20, 0.06, 0.04 (where 0 denotes missing in the category).
- Dunkin’ Donuts relative frequencies: 0.22, 0.44, 0.28, 0.06, 0, 0 (with blanks indicating no data).
- Insight: Dunkin’ Donuts service times appear to be generally shorter than McDonald’s.
Histograms and the Shape of Data
- Key concept: A histogram is a graph that makes the distribution easier to interpret than a table of numbers.
- Histogram definition: A graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data). The horizontal axis represents classes of quantitative data values, and the vertical axis represents frequencies. The heights of the bars correspond to frequencies.
Important Uses of a Histogram
- Visually displays the shape of the distribution
- Shows the location of the center of the data
- Shows the spread (variability) of the data
- Identifies outliers
Relative Frequency Histogram
- A histogram using relative frequencies (instead of counts) on the vertical axis while keeping the same class boundaries.
Critical Thinking: Interpreting Histograms
- Examine the histogram to learn about:
- Center (roughly where the data cluster)
- Variation (spread of the data)
- Shape of the distribution (e.g., symmetric, skewed)
- Outliers
- Time (for time-series data, if applicable)
Common Distribution Shapes
- Bell-shaped (Normal) distribution
- Uniform distribution
- Skewed to the right (positive skew)
- Skewed to the left (negative skew)
- Visual examples often shown as histograms or density plots
Normal Distribution
- A histogram that is roughly bell-shaped indicates a normal distribution.
Skewness (1 of 3)
- Skewness: A distribution that is not symmetric and extends more to one side than the other.
Skewness (2 of 3) – Right-skewed (Positive)
- Data skewed to the right have a longer right tail.
Skewness (3 of 3) – Left-skewed (Negative)
- Data skewed to the left have a longer left tail.
Assessing Normality with Normal Quantile Plots (QQ plots) (1 of 5)
- Criteria for normality via QQ plot: A normal distribution yields a pattern that is reasonably close to a straight line with no conspicuous non-linear patterns.
Assessing Normality with Normal Quantile Plots (2 of 5)
- Not normal if: the points do not lie close to a straight line, or there is a systematic pattern not aligned with a straight line.
Assessing Normality with Normal Quantile Plots (3 of 5)
- Normal distribution: points roughly lie on a straight line with no systematic deviations.
Assessing Normality with Normal Quantile Plots (4 of 5)
- Not a normal distribution if points do not lie close to the line.
Assessing Normality with Normal Quantile Plots (5 of 5)
- Not normal if points show a systematic pattern not described by a straight line.
Graphs that Enlighten and Graphs that Deceive
- Graphs that enlighten help us understand data; graphs that deceive can mislead by design or poor practice.
- Technology enables powerful graphing, but must be used responsibly.
Graphs that Enlighten: Dotplots
- Dotplots: A graph of quantitative data where each value is plotted as a dot above a horizontal scale. Dots with equal values are stacked.
- Features: Displays the shape of the distribution; often possible to recreate the original data.
Stemplots (Stem-and-Leaf Plots)
- Stemplot: Represents data by separating each value into a stem (e.g., leftmost digits) and a leaf (e.g., rightmost digit).
- Features: Shows the shape of the distribution; retains original data values; data are sorted.
Time-Series Graph (Trend Over Time)
- Time-series graph: Quantitative data collected at different points in time (monthly, yearly).
- Feature: Reveals trends over time.
Bar Graphs and Pareto Charts
- Bar graph: Bars of equal width for categorical data; may have gaps between bars. Shows relative distribution across categories.
- Pareto chart: A bar graph for categorical data with bars arranged in descending order of frequency, highlighting the most important categories.
Pie Chart
- Pie charts depict categorical data as slices of a circle; slice size is proportional to the category frequency.
Frequency Polygon
- Frequency polygon: A graph using line segments connected to points located at class midpoints; similar to a histogram but uses lines instead of bars.
- Relative frequency polygon uses relative frequencies on the vertical axis.
Graphs That Deceive (1 of 4)
- Nonzero vertical axis: Exaggerating differences by starting the vertical axis at a value greater than zero.
Graphs That Deceive (2 of 4)
- Always inspect whether the vertical axis starts at zero or elsewhere, as this affects perceived differences.
Graphs That Deceive (3 of 4) – Pictographs
- Pictographs use drawings of objects and can be misleading if not scaled properly (one-dimensional vs two- or three-dimensional representations).
Graphs That Deceive (4 of 4) – Principles of basic geometry
- Doubling each side of a square increases area by a factor of 4; doubling each side of a cube increases volume by a factor of 8. Pictographs can distort these relationships.
Pictographs Example
- Example: 1970: 37% of U.S. adults smoked vs 2013: 18% of U.S. adults smoked.
Concluding Thoughts on Graphic Displays
- Beyond these graphs, there are many others. The goal of graph design is to enlighten, not mislead.
- Edward Tufte’s principles (from The Visual Display of Quantitative Information):
- For small data sets (n ≤ 20), a table may be preferable to a graph.
- Graphs should reveal the true nature of the data, not be visually distracting.
- Do not distort data; use graphs to reveal truth.
- Most ink on a graph should be data, not design elements.
Scatterplots, Correlation, and Regression (Introduction)
- Key concept: Introduce analysis of paired data, correlation, and a simple introduction to regression.
Scatterplot and Correlation (1 of 2)
- Correlation: A relationship exists when the values of one variable are associated with the values of another variable.
- Linear correlation: When the association can be approximated by a straight line.
Scatterplot (2 of 2)
- Scatterplot: A plot of paired data (x, y) with a horizontal x-axis and vertical y-axis. x represents the first variable; y the second.
Example: Waist and Arm Correlation (1 of 2)
- Data suggest a correlation between waist circumference and arm circumference.
Example: No Correlation (2 of 2)
- Data show no clear pattern between weights and pulse rates.
Linear Correlation Coefficient r
- r denotes the linear correlation coefficient and measures the strength of the linear association between two variables.
Using r for Determining Correlation
- r is always between $-1$ and $1$:
- If $|r|$ is close to 1, a strong linear relationship exists; if $|r|$ is close to 0, there is little to no linear correlation.
Example: Correlation between Shoe Print Lengths and Heights? (1 of 2)
- Data: Shoe Print Length (cm): 29.7, 29.7, 31.4, 31.8, 27.6; Height (cm): 175.3, 177.8, 185.4, 175.3, 172.7
Example: Correlation between Shoe Print Lengths and Heights? (2 of 2)
- The plot does not clearly show a strong linear pattern; interpretation is inconclusive from the scatter alone.
P-Value
- P-value: If there is no true linear correlation between two variables, the P-value is the probability of obtaining a sample correlation as extreme as the observed one.
Interpreting a P-Value (Previous Example, n = 5)
- If the P-value is 0.294, there is a high chance (29.4%) of obtaining $r = 0.591$ (or more extreme) by chance when there is no true correlation.
- Conclusion: Not sufficient evidence to conclude a linear correlation exists.
Interpreting a P-Value (n = 40 example)
- If $r = 0.813$ and $\text{P-value} = 0.000$, this small P-value provides strong evidence of a linear correlation.
Regression
- Regression: Given paired data, the regression line (line of best fit or least-squares line) is the straight line that best fits the scatterplot.
- General form of regression equation: where $b0$ is the intercept and $b1$ is the slope.
Example: Regression Line (1 of 2)
- Plotting height vs shoe print length yields a line with a vertical axis labeled Height and horizontal axis Shoe Print Length.
Example: Regression Line (2 of 2)
The regression coefficients given: $b0 = 80.9$ (intercept) and $b1 = 3.22$ (slope).
Regression equation:
Notes: The regression line is the least-squares line that minimizes the sum of squared residuals between observed values and the line.
Summary of Key Quantities
Frequency f_i: count of data in class i.
N: total number of data values.
Relative frequency pi: or
Cumulative frequency F_i: sum of frequencies up to and including class i.
Class width w: difference between consecutive lower class limits: .
Class boundaries separate classes without gaps; for integer data, boundaries are typically halfway between adjacent class limits.
Midpoint of class i: .
Normal vs skewed distributions: assessed via histogram shape, QQ plots, and skewness direction (positive or negative).
Important warning: Graphs should accurately convey data; beware deceptive scaling, especially nonzero axes, or pictographs with disproportionate symbol sizes.
Practical use: Histograms and relative frequency histograms are often easier to interpret than raw frequency tables for understanding distributions, central tendency, spread, and outliers. They also guide decisions about further statistical analysis (e.g., normality assessment, appropriate models).