Histograms and Frequency Tables – Study Notes (Transcript Review)

Class width and class limits

  • Class width is the difference between lower limits across consecutive classes, not the difference between the lower and upper limits within a single class.
  • In the example discussed, the class width was 8 (despite some confusion in the narration where 9 was mentioned briefly).
  • The first lower class limit is the starting point of the dataset for constructing the classes; the rest of the class lower limits are obtained by adding the class width to the previous lower limit:
    • If the first lower limit is
      L<em>1L<em>1 and the class width is 8, then the later lower limits are L</em>k=L1+(k1)×8L</em>k = L_1 + (k-1)\times 8.
  • Once you have the lower limit for a class, the corresponding upper limit for that class (assuming discrete data and inclusive intervals) is
    U<em>k=L</em>k+w1U<em>k = L</em>k + w - 1
    where ww is the class width.
  • Example construction (assuming first lower limit L1=54L_1 = 54 and width w=8w=8):
    • Class 1: 54x6154\le x \le 61
    • Class 2: 62x6962\le x \le 69
    • Class 3: 70x7770\le x \le 77
    • and so on. (Note: some parts of the transcript show uncertainty about exact endpoints; the standard convention for integer data is to use inclusive intervals with width 8, which yields the above bounds.)
  • Boundary considerations: the class boundaries can also be described in terms of continuous intervals if you shift to class boundaries (e.g., 53.5 to 61.5, 61.5 to 69.5, etc.) when you prefer to treat data as continuous. The key idea is that the class width remains 8 and the lower limit increments by 8.
  • Important caveat from the transcript: there is some confusion about whether numbers like 52 should be included in the first class. The general approach is to decide the first lower limit and then apply the fixed width consistently; numbers below the first lower limit would fall into a preceding class or would require widening/adjusting the first class if you need to include those data points.
  • Midpoints of classes: the midpoint of a class is useful for certain plots (e.g., when you want a representative value for the class). For a class with lower limit L<em>kL<em>k and upper limit U</em>kU</em>k,
    midpoint m<em>k=L</em>k+U<em>k2\text{midpoint } m<em>k = \dfrac{L</em>k + U<em>k}{2} Example: for class 1 (54 to 61), m</em>1=54+612=57.5m</em>1 = \dfrac{54 + 61}{2} = 57.5

Midpoints and class representation

  • The instructor notes that the midpoint of a class is often used to represent the entire class when communicating data (e.g., for a grouped data bar or when you need a single representative value per class).
  • The midpoint is tied to the class interval and is used in some approximate calculations, but not necessarily needed for a basic histogram or frequency table.
  • In the context of the video, the midpoint is described as a necessary step for completing the frequency distribution and visual representation (such as in histograms or related displays).

Building the frequency table and histogram (steps)

  • Data preparation: you sort or arrange data into classes defined by their lower and upper limits.
  • Step 1: determine class width $w$ and the first lower limit $L_1$.
  • Step 2: compute subsequent lower limits using L<em>k=L</em>1+(k1)wL<em>k = L</em>1 + (k-1)w and upper limits using U<em>k=L</em>k+w1U<em>k = L</em>k + w - 1 (for inclusive integer data).
  • Step 3: set up a header for the frequency table with columns such as: Class Interval, Lower Limit, Upper Limit, Frequency ($f$), Cumulative Frequency (CF), Relative Frequency ($r$), Percent, Cumulative Relative Frequency (CRF).
  • Step 4: sort data into the appropriate class by counting how many data points fall into each class (tally or counting).
  • Step 5: compute frequencies for each class. If you have a data point that falls into a class, increment that class’s frequency.
  • Step 6: compute the total number of data points $N$.
  • Step 7: compute relative frequencies
    r<em>k=f</em>kNr<em>k = \dfrac{f</em>k}{N}
    and, if desired, percentages
    %<em>k=100×r</em>k=100fkN\%<em>k = 100 \times r</em>k = \dfrac{100 f_k}{N}
  • Step 8: compute cumulative frequencies
    CF<em>k=</em>i=1kfiCF<em>k = \sum</em>{i=1}^{k} f_i
  • Step 9: compute cumulative relative frequencies
    CRF<em>k=CF</em>kNCRF<em>k = \dfrac{CF</em>k}{N}
  • Step 10: draw the histogram using the class intervals on the x-axis and frequencies $fk$ (or $CFk$ for the cumulative plot) on the y-axis. If data are continuous, there should be no gaps between adjacent bars (i.e., the bars should touch) for a histogram of frequencies.
  • Step 11: optionally draw the cumulative frequency graph (a cumulative frequency polygon) using the class upper (or lower) class boundaries vs. $CF_k$ (or CRF) values. It will appear smoothed because you are accumulating data point counts as you move to higher classes.

Cumulative frequency table and graph

  • Cumulative frequency table: add a CF column to the frequency table where each CF entry equals the sum of all frequencies up to that class:
    • If the first class has $f1$ data points, then $CF1 = f_1$.
    • If the second class has $f2$ data points, then $CF2 = f1 + f2$.
    • And so on: $CFk = \sum{i=1}^k f_i$.
  • Cumulative frequency graph: a plot of $CF_k$ against class boundaries. It always increases (non-decreasing) since it's a running total.
  • Relationship to total N: the final cumulative frequency $CF_K$ equals the total number of observations $N$.
  • In context, the instructor notes that the cumulative frequency plot is not as intuitive for seeing the actual pattern of the data as the standard histogram, but it shows how quickly the data accumulates across the range and highlights differences between adjacent class intervals (e.g., the gap between 5 and 8 in the cumulative plot).

Relative frequency and interpretation

  • Relative frequency measures how often a class occurs relative to the entire data set:
    r<em>k=f</em>kNr<em>k = \dfrac{f</em>k}{N}
  • Percent representation per class:
    %<em>k=100×r</em>k=100fkN\%<em>k = 100 \times r</em>k = \dfrac{100 f_k}{N}
  • The transcript provides an example where a total $N$ is discussed (e.g., $N=40$ or $N=37$ in the sample discussion). If, for instance, $f1 = 7$ and $N = 37$, then r</em>1=7370.189(about 18.9%)r</em>1 = \dfrac{7}{37} \approx 0.189 \quad \text{(about }18.9\%\text{)}
  • The cumulative relative frequencies are obtained by dividing the cumulative counts by $N$:
    CRF<em>k=CF</em>kNCRF<em>k = \dfrac{CF</em>k}{N}
  • Practical interpretation: relative frequencies and percentages tell you how the data are distributed across the classes in a way that is independent of the sample size, making it easier to compare distributions from different data sets.

Interpreting histograms and distributions

  • A histogram visually communicates where most data points lie, the spread of the data, and the general shape of the distribution.
  • In the transcript, the speaker highlights that a histogram helps you quickly see that most data were clustered on the left side (e.g., values between 1 and 25 in the example) and fewer data on the right side, indicating skewness toward the left.
  • Key concepts related to distribution shape:
    • Mode: the value(s) that occur most frequently; the "average value" that shows up most often is central to many interpretations.
    • Central tendency: mean, median, and mode; the mean is a typical measure but may be influenced by skewness.
    • Skewness: left-skewed (negative) if tail is on the left; right-skewed (positive) if tail is on the right; unimodal vs bimodal.
    • bimodal: a distribution with two distinct peaks indicates two common ranges of values (two modes).
  • The transcript notes that histograms are often used in business analytics to communicate quickly where data points cluster and to support decisions (e.g., where to allocate resources). Relative to the normal distribution, histograms help assess whether data look roughly normal, skewed, or multimodal.

Practical implications and data integrity considerations

  • Ensure data are grouped correctly into non-overlapping, contiguous classes. Misplacing data points can distort frequencies and mislead interpretations.
  • If a class width is chosen, the lower class limits should be spaced by that width to avoid gaps or overlaps between classes.
  • When describing data visually, avoid misrepresenting the distribution by omitting relevant gaps or mis-sizing bars. The class width and the number of classes influence the histogram's appearance.
  • When communicating results, choose class intervals and representation (histogram vs cumulative) that best illustrate the message (e.g., a histogram for pattern and spread, a cumulative plot for totals up to a point).
  • Ethical and practical implications: clear visuals promote correct interpretation; biased or poorly constructed histograms can mislead stakeholders. Always link the visualization to the underlying data and explicitly state the total sample size $N$ and the class definitions used.

Real-world relevance and applications

  • Histograms and frequency tables are foundational for exploratory data analysis in fields like business analytics, economics, psychology, and public health.
  • They enable quick assessment of distributional properties before applying inferential statistics or modeling.
  • They support decision-making by highlighting where most observations lie, potential outliers, and the degree of concentration in certain ranges.
  • The transcript emphasizes the practical use of histograms for communication: a business analyst can say, with visual backing, where the majority of data points lie and how the distribution shifts when comparing scenarios.

Quick recap of formulas to memorize

  • Class width:
    w=L<em>k+1L</em>kw = L<em>{k+1} - L</em>k
  • Class k lower and upper limits (discrete data with width $w$):
    L<em>k=L</em>1+(k1)w,U<em>k=L</em>k+w1L<em>k = L</em>1 + (k-1)w, \quad U<em>k = L</em>k + w - 1
  • Class midpoint:
    m<em>k=L</em>k+Uk2m<em>k = \dfrac{L</em>k + U_k}{2}
  • Class frequency and totals:
    • Frequency: fkf_k
    • Total observations: N=<em>kf</em>kN = \sum<em>k f</em>k
  • Relative frequency and percentage:
    r<em>k=f</em>kN,%<em>k=100×r</em>k=100fkNr<em>k = \dfrac{f</em>k}{N}, \quad \%<em>k = 100 \times r</em>k = \dfrac{100 f_k}{N}
  • Cumulative frequency:
    CF<em>k=</em>i=1kfiCF<em>k = \sum</em>{i=1}^k f_i
  • Cumulative relative frequency:
    CRF<em>k=CF</em>kNCRF<em>k = \dfrac{CF</em>k}{N}

Quiz context and study tips (from transcript)

  • There is a quiz on Friday covering sections 1.1 through 2.1; it will be administered in about ten minutes; students have access to about 66% of the quiz material right now.
  • You can use homework questions and solutions while preparing, but you cannot use notes for the online quiz.
  • The instructor plans to review or work through parts of the material but does not guarantee a full walkthrough during class time.
  • When practicing, focus on building and interpreting frequency tables, histograms, cumulative frequency tables, and relative frequencies; also practice converting counts to percentages and understanding what the visual representations imply about the data.

Quick tips for building your own frequency table from data (practical checklist)

  • Decide the class width $w$ and the first lower limit $L_1$.
  • Compute subsequent class limits using L<em>k=L</em>1+(k1)wL<em>k = L</em>1 + (k-1)w, and upper limits U<em>k=L</em>k+w1U<em>k = L</em>k + w - 1 for discrete data.
  • Place data into each class by counting how many observations fall into each interval.
  • Create a header with: Class Interval, Lower Limit, Upper Limit, Frequency ($f$), CF, $r$, % and CRF.
  • Compute the total $N$ and then $rk$ and $\%k$ for each class.
  • Compute the cumulative values CF and CRF.
  • Draw the histogram using class intervals on the x-axis and frequencies on the y-axis; consider a separate cumulative plot if needed.
  • For information retention, note how to interpret the shape, skewness, and modes (unimodal vs bimodal).