Histograms, Distributions, and Scatter Plots (Notes)

Histograms, Distributions, and Scatter Plots

  • Data basics

    • When you start with data, a histogram is a way to present a large collection of numbers so you can see patterns.
    • A histogram is built from a frequency distribution: data are broken into classes (bins) of equal width, and the count (frequency) of data values in each class is shown as the height of a bar.
  • Frequency distributions and class width

    • Classes have equal width. Example: in a histogram with a class width of five years, you might see class intervals like [60, 65), [65, 70), etc. If data are integers, these become 60–64, 65–69, etc.
    • Boundary handling is up to you, but it must be uniform across all classes. A common convention is to assign boundary values to the left class: the first class contains 60 ≤ x < 65, the second 65 ≤ x < 70, and so on. If data are integers, this corresponds to 60–64 for the first class, 65–69 for the second, etc.
    • For each class, the class width is the same across all classes, computed as the difference between left-hand limits: e.g. with left-hand limits 75, 125, 175, 225, 275 the width is 125−75 = 50, 175−125 = 50, etc. So each class spans 50 units.
    • Example (class width and left limits):
    • Left-hand limits: 75, 125, 175, 225, 275
    • Class width: 50=12575=175125=225175=27522550 = 125 - 75 = 175 - 125 = 225 - 175 = 275 - 225
    • Class intervals (left-inclusive, right-exclusive): [75,125),[125,175),[175,225),[225,275),[275,325)[75,125), [125,175), [175,225), [225,275), [275,325)
    • If data are integers, the discrete equivalents are [75,124],[125,174],[175,224],[225,274],[275,324][75,124], [125,174], [175,224], [225,274], [275,324].
    • Frequencies in each class are the counts of data values that fall into the class.
    • Example frequencies (illustrative): first class 11, second class 24, third class 10, fourth class 3, fifth class (not specified in the transcript). The maximum frequency mentioned is 24.
    • The bar for a histogram is drawn for each class, adjacent to each other with no gaps between bars (unlike a bar graph).
    • Gaps between bars occur only if a class has zero data values (an empty class).
  • Bar graphs vs histograms

    • Bar graphs are used for qualitative (categorical) data, where the bars may have gaps between them; each bar represents a category (e.g., languages like Spanish, Chinese).
    • Histograms are used for quantitative (numerical) data and have bars with no gaps, representing contiguous numerical intervals.
  • Example frequency distribution (time in seconds)

    • Smallest class: 75–124 (width 50) with frequency 11
    • Second class: 125–174 with frequency 24
    • Third class: 175–224 with frequency 10
    • Fourth class: 225–274 with frequency 3
    • Fifth class: 275–324 (width 50) with some frequency (not explicitly given in the transcript)
    • Note on width calculation: class left limits are 75, 125, 175, 225, 275; width = 50.
    • Discussion question: what is the width of each class? Answer: 5050.
    • If you were to draw the histogram, you would place bars with heights corresponding to these frequencies, with no gaps between bars if data exist in consecutive classes.
  • Relative frequency histogram

    • Relative frequency = proportion of the total data in a class, i.e., ext{relative frequency} = rac{fi}{n} where f</em>if</em>i is the class frequency and nn is the total number of data values.
    • Example data (prices): class widths are 10 units (e.g., $1–$10, $11–$20, …).
    • Frequencies for the example: first class 20, second 21, third 13, fourth 8, fifth 4. Total data points: n=66n = 66.
    • Relative frequencies:
    • First class: rac20660.303rac{20}{66} \approx 0.303
    • Second class: rac21660.318rac{21}{66} \approx 0.318
    • Third class: rac13660.197rac{13}{66} \approx 0.197
    • Fourth class: rac8660.121rac{8}{66} \approx 0.121
    • Fifth class: rac4660.061rac{4}{66} \approx 0.061
    • In a relative frequency histogram, the bars have the same shape as the frequency histogram, but the y-axis shows proportions (0 to 1). The y-axis should not exceed 1 since it represents a fraction of the total data.
  • Uses of histograms

    • Visual sense of central tendency (where data are centered) and spread (how data are dispersed).
    • Histograms can help identify outliers (values far from the main cluster).
    • Normal distribution intuition: a bell-shaped, symmetric histogram with data concentrated around the center; if you connect the tops of the bars with a smooth curve, it resembles a bell curve.
    • Skewness intuition:
    • Right-skewed (positive skew): tail extends to the right; mean > median.
    • Left-skewed (negative skew): tail extends to the left; mean < median.
    • Symmetric distribution: mean ≈ median.
    • Mnemonic for skewness: if skewed left, the histogram might resemble the toes of the left foot; if skewed right, the toes of the right arm.
  • Normal quantile plots (Q-Q plots)

    • Purpose: provide additional information about normality beyond histograms.
    • Interpretation (as described in the transcript):
    • Each data value is plotted as a point with the y-axis representing a standardized score (a future topic: standardization, z-scores).
    • If the plotted points lie close to a straight line with no obvious pattern, the data may be from a normal distribution.
    • Systematic patterns or large deviations from a straight line suggest departures from normality.
    • The transcript notes that reading and constructing these plots is not part of the current course; interpreting them is the focus.
  • Section 2.4 overview: scatter plots, correlation, and regression

    • Purpose: study relationships between pairs of variables.
    • Example scenario: shoe print length (x) and suspect height (y). Plot pairs as scatter plot points to assess any relationship.
    • Scatter plot basics
    • Each point represents one paired observation: (xi, yi).
    • If the points tend to lie near a straight line, there may be a linear relationship; otherwise, the relationship may be non-linear.
    • Positive linear relationship: as x increases, y tends to increase; slope positive.
    • Negative linear relationship: as x increases, y tends to decrease; slope negative.
    • Non-linear relationships may show curved patterns (e.g., quadratic or other shapes) where the relationship exists but is not linear.
    • Notation: in a scatter plot, one variable is plotted on the x-axis and the other on the y-axis; e.g., x = shoe length, y = height.
    • Examples discussed in the transcript:
    • Overhead width vs weight: a relationship is visible (evidence of some correlation).
    • Height of a president vs height of the opponent: unclear pattern (not a strong linear relationship in the example).
    • Other plots show positive linear patterns, curved non-linear patterns, and mixed patterns. Only the leftmost plot shows a clear linear relationship.
    • Correlation vs causation
    • Correlation measures the strength and direction of a linear relationship, but it does not imply causation. Two variables may be related due to a third factor or coincidence.
    • Fun, non-scientific example: chocolate consumption and Nobel Prize counts across countries. Wealthier countries may produce more Nobel Prizes and also spend more on chocolate, so wealth confounds the relationship.
    • Measuring correlation: the linear correlation coefficient, r
    • Symbol: rr (a number between -1 and 1).
    • r measures the strength of a linear relationship between x and y.
    • Formula (standard form):
      r=<em>i=1n(x</em>ixˉ)(y<em>iyˉ)</em>i=1n(x<em>ixˉ)2  </em>i=1n(yiyˉ)2r \,=\frac{\displaystyle \sum<em>{i=1}^n (x</em>i-\bar{x})(y<em>i-\bar{y})}{\sqrt{\displaystyle \sum</em>{i=1}^n (x<em>i-\bar{x})^2}\; \sqrt{\displaystyle \sum</em>{i=1}^n (y_i-\bar{y})^2}}
    • The sign of r indicates the slope sign of the best-fit line if the relationship is linear: positive r -> positive slope; negative r -> negative slope.
    • Magnitude of r indicates how closely the data cluster around the straight line: |r| near 1 means strong linear relationship; |r| near 0 means weak linear relationship.
    • Special values and interpretations:
      • r = 1: all points lie on a straight line with positive slope.
      • r = -1: all points lie on a straight line with negative slope.
      • r = 0.3 or r = 0.5: positive linear trend but not perfect clustering around a line.
      • r = 0: no apparent linear relationship; could still have a non-linear relationship.
    • Using r in a statistical test of linear relationship
    • To assess significance, compare the observed r to critical values from a table that depends on the sample size n.
    • Example: a scatter plot with 5 data points yields critical values of ±0.878. This means:
      • If |r| > 0.878, there is evidence of a linear relationship at the chosen significance level (as per the table’s standard test).
      • If |r| ≤ 0.878, the evidence for a linear relationship is not strong at that level.
    • In practice, you would select the appropriate critical value from a table based on your sample size and desired alpha level, using the homework tools to access the table.
  • Quick recap of key ideas to remember

    • Histogram vs bar graph: histograms for quantitative data with no gaps; bar graphs for qualitative data with gaps between bars.
    • Class width must be kept constant across all classes; boundary decisions must be uniform.
    • Relative frequency histograms show proportions; their y-axis tops at 1 (100%).
    • Central tendency and spread are primary focuses of histogram interpretation; histograms can reveal outliers and skewness.
    • Normal distributions are bell-shaped and symmetric; skewness describes deviations from symmetry (left or right).
    • Normal quantile plots help assess normality by checking alignment to a straight line.
    • Scatter plots visualize relationships between two variables; correlation coefficient r quantifies linear association; remember: correlation ≠ causation.
    • The sign and magnitude of r tell you the direction and strength of a linear relationship; significance tests compare r to critical values that depend on n.
  • Homework and lab guidance from the transcript

    • The homework problems emphasized interpreting histograms or selecting the correct histogram from multiple choices rather than constructing a histogram from scratch.
    • Demonstrates that understanding how to read and interpret distributions and scatter plots is essential for the week’s assessment.