Histograms, Distributions, and Scatter Plots (Notes)
Histograms, Distributions, and Scatter Plots
Data basics
- When you start with data, a histogram is a way to present a large collection of numbers so you can see patterns.
- A histogram is built from a frequency distribution: data are broken into classes (bins) of equal width, and the count (frequency) of data values in each class is shown as the height of a bar.
Frequency distributions and class width
- Classes have equal width. Example: in a histogram with a class width of five years, you might see class intervals like [60, 65), [65, 70), etc. If data are integers, these become 60–64, 65–69, etc.
- Boundary handling is up to you, but it must be uniform across all classes. A common convention is to assign boundary values to the left class: the first class contains 60 ≤ x < 65, the second 65 ≤ x < 70, and so on. If data are integers, this corresponds to 60–64 for the first class, 65–69 for the second, etc.
- For each class, the class width is the same across all classes, computed as the difference between left-hand limits: e.g. with left-hand limits 75, 125, 175, 225, 275 the width is 125−75 = 50, 175−125 = 50, etc. So each class spans 50 units.
- Example (class width and left limits):
- Left-hand limits: 75, 125, 175, 225, 275
- Class width:
- Class intervals (left-inclusive, right-exclusive):
- If data are integers, the discrete equivalents are .
- Frequencies in each class are the counts of data values that fall into the class.
- Example frequencies (illustrative): first class 11, second class 24, third class 10, fourth class 3, fifth class (not specified in the transcript). The maximum frequency mentioned is 24.
- The bar for a histogram is drawn for each class, adjacent to each other with no gaps between bars (unlike a bar graph).
- Gaps between bars occur only if a class has zero data values (an empty class).
Bar graphs vs histograms
- Bar graphs are used for qualitative (categorical) data, where the bars may have gaps between them; each bar represents a category (e.g., languages like Spanish, Chinese).
- Histograms are used for quantitative (numerical) data and have bars with no gaps, representing contiguous numerical intervals.
Example frequency distribution (time in seconds)
- Smallest class: 75–124 (width 50) with frequency 11
- Second class: 125–174 with frequency 24
- Third class: 175–224 with frequency 10
- Fourth class: 225–274 with frequency 3
- Fifth class: 275–324 (width 50) with some frequency (not explicitly given in the transcript)
- Note on width calculation: class left limits are 75, 125, 175, 225, 275; width = 50.
- Discussion question: what is the width of each class? Answer: .
- If you were to draw the histogram, you would place bars with heights corresponding to these frequencies, with no gaps between bars if data exist in consecutive classes.
Relative frequency histogram
- Relative frequency = proportion of the total data in a class, i.e., ext{relative frequency} = rac{fi}{n} where is the class frequency and is the total number of data values.
- Example data (prices): class widths are 10 units (e.g., $1–$10, $11–$20, …).
- Frequencies for the example: first class 20, second 21, third 13, fourth 8, fifth 4. Total data points: .
- Relative frequencies:
- First class:
- Second class:
- Third class:
- Fourth class:
- Fifth class:
- In a relative frequency histogram, the bars have the same shape as the frequency histogram, but the y-axis shows proportions (0 to 1). The y-axis should not exceed 1 since it represents a fraction of the total data.
Uses of histograms
- Visual sense of central tendency (where data are centered) and spread (how data are dispersed).
- Histograms can help identify outliers (values far from the main cluster).
- Normal distribution intuition: a bell-shaped, symmetric histogram with data concentrated around the center; if you connect the tops of the bars with a smooth curve, it resembles a bell curve.
- Skewness intuition:
- Right-skewed (positive skew): tail extends to the right; mean > median.
- Left-skewed (negative skew): tail extends to the left; mean < median.
- Symmetric distribution: mean ≈ median.
- Mnemonic for skewness: if skewed left, the histogram might resemble the toes of the left foot; if skewed right, the toes of the right arm.
Normal quantile plots (Q-Q plots)
- Purpose: provide additional information about normality beyond histograms.
- Interpretation (as described in the transcript):
- Each data value is plotted as a point with the y-axis representing a standardized score (a future topic: standardization, z-scores).
- If the plotted points lie close to a straight line with no obvious pattern, the data may be from a normal distribution.
- Systematic patterns or large deviations from a straight line suggest departures from normality.
- The transcript notes that reading and constructing these plots is not part of the current course; interpreting them is the focus.
Section 2.4 overview: scatter plots, correlation, and regression
- Purpose: study relationships between pairs of variables.
- Example scenario: shoe print length (x) and suspect height (y). Plot pairs as scatter plot points to assess any relationship.
- Scatter plot basics
- Each point represents one paired observation: (xi, yi).
- If the points tend to lie near a straight line, there may be a linear relationship; otherwise, the relationship may be non-linear.
- Positive linear relationship: as x increases, y tends to increase; slope positive.
- Negative linear relationship: as x increases, y tends to decrease; slope negative.
- Non-linear relationships may show curved patterns (e.g., quadratic or other shapes) where the relationship exists but is not linear.
- Notation: in a scatter plot, one variable is plotted on the x-axis and the other on the y-axis; e.g., x = shoe length, y = height.
- Examples discussed in the transcript:
- Overhead width vs weight: a relationship is visible (evidence of some correlation).
- Height of a president vs height of the opponent: unclear pattern (not a strong linear relationship in the example).
- Other plots show positive linear patterns, curved non-linear patterns, and mixed patterns. Only the leftmost plot shows a clear linear relationship.
- Correlation vs causation
- Correlation measures the strength and direction of a linear relationship, but it does not imply causation. Two variables may be related due to a third factor or coincidence.
- Fun, non-scientific example: chocolate consumption and Nobel Prize counts across countries. Wealthier countries may produce more Nobel Prizes and also spend more on chocolate, so wealth confounds the relationship.
- Measuring correlation: the linear correlation coefficient, r
- Symbol: (a number between -1 and 1).
- r measures the strength of a linear relationship between x and y.
- Formula (standard form):
- The sign of r indicates the slope sign of the best-fit line if the relationship is linear: positive r -> positive slope; negative r -> negative slope.
- Magnitude of r indicates how closely the data cluster around the straight line: |r| near 1 means strong linear relationship; |r| near 0 means weak linear relationship.
- Special values and interpretations:
- r = 1: all points lie on a straight line with positive slope.
- r = -1: all points lie on a straight line with negative slope.
- r = 0.3 or r = 0.5: positive linear trend but not perfect clustering around a line.
- r = 0: no apparent linear relationship; could still have a non-linear relationship.
- Using r in a statistical test of linear relationship
- To assess significance, compare the observed r to critical values from a table that depends on the sample size n.
- Example: a scatter plot with 5 data points yields critical values of ±0.878. This means:
- If |r| > 0.878, there is evidence of a linear relationship at the chosen significance level (as per the table’s standard test).
- If |r| ≤ 0.878, the evidence for a linear relationship is not strong at that level.
- In practice, you would select the appropriate critical value from a table based on your sample size and desired alpha level, using the homework tools to access the table.
Quick recap of key ideas to remember
- Histogram vs bar graph: histograms for quantitative data with no gaps; bar graphs for qualitative data with gaps between bars.
- Class width must be kept constant across all classes; boundary decisions must be uniform.
- Relative frequency histograms show proportions; their y-axis tops at 1 (100%).
- Central tendency and spread are primary focuses of histogram interpretation; histograms can reveal outliers and skewness.
- Normal distributions are bell-shaped and symmetric; skewness describes deviations from symmetry (left or right).
- Normal quantile plots help assess normality by checking alignment to a straight line.
- Scatter plots visualize relationships between two variables; correlation coefficient r quantifies linear association; remember: correlation ≠ causation.
- The sign and magnitude of r tell you the direction and strength of a linear relationship; significance tests compare r to critical values that depend on n.
Homework and lab guidance from the transcript
- The homework problems emphasized interpreting histograms or selecting the correct histogram from multiple choices rather than constructing a histogram from scratch.
- Demonstrates that understanding how to read and interpret distributions and scatter plots is essential for the week’s assessment.