Notes on Graphical Displays in Descriptive Statistics (Histograms, Relative Histograms, Pie/Pareto, and Scatter Plots)
Frequency Histograms
A histogram is a graphical representation of a frequency distribution.
Horizontal axis (x-axis): quantitative classes (must be quantitative, not qualitative/categorical).
Vertical axis (y-axis): frequencies (how often each class occurs).
Purpose: to show how often the different classes occur, i.e., a picture of the data's distribution.
Consecutive bars must touch; if bars don’t touch, it’s not technically a histogram (it becomes something else).
A “break” symbol can be used on the axis to skip over gaps in the data (skip to the first class).
Two common construction methods:
Using class midpoints (midpoints are plotted as the x-values).
Using class limits (lower and upper limits for each class).
Midpoints and limits discussion in the lecture:
Midpoints: every class has a midpoint m_i, often used in the pictured histogram.
With limits: you can plot using the actual class limits; sometimes software makes it easier to shift by 0.5 units to align rectangles with the axis.
Example visual cues from a plotted histogram:
The tallest bars indicate the most frequent ranges (e.g., the ranges with the highest frequencies).
A dip between bars can indicate fewer observations in adjacent ranges.
Labels are essential: a histogram without axis labels and a title is not properly informative.
Practical note: when you deliver graphics electronically, always include axis labels and a title so the reader knows what is being shown.
Relative Frequency Histograms
Relative frequency histogram uses the same class structure, but the y-axis shows relative frequencies (proportions) instead of raw counts.
Relative frequency for class i: where is the count in class i and is the total number of observations.
Shape comparison: the overall shape (where the bulk of data lies) is the same as the regular histogram, but the scale is different and less dramatic.
Storytelling choice: relative frequency helps compare distributions across datasets of different sizes or highlight proportional differences.
The lecturer notes that the choice between a frequency histogram and a relative frequency histogram depends on the story you want to tell (e.g., how much one class differs from another, or how a given distribution compares to another scale).
Labels and Best Practices
Always include a title and labeled axes on any graphic intended for submission or presentation.
If a graphic is missing labels, it’s not a proper histogram or chart and can be hard to interpret.
When presenting data, consider what story you want to tell and choose the graph type accordingly (histogram vs relative histogram vs other charts).
Pie Charts
A pie chart represents qualitative (categorical) data, showing how many observations fall into each category.
It is not a histogram: histograms are for quantitative classes, while pie charts summarize categories (e.g., grades A, B, C).
The lecturer notes: pie charts can be controversial or less informative; there are criticisms (e.g., hard to compare slice sizes precisely, may mislead about proportions).
Connection to degrees: a circle has 360 degrees; a 100% slice corresponds to 360°, since angle = 360° × p, where p is the relative frequency of that category. This can be confusing because 100% is not inherently a measure of angle equivalence without doing the conversion.
Practical advice: avoid pie charts when possible; Pareto charts or other bar-based displays often communicate the story more clearly.
If a pie chart is used, include a legend or labels; ensure the angles accurately reflect relative frequencies.
Pareto Charts
Pareto chart is a bar graph with bars ordered from highest frequency to lowest frequency (i.e., descending order).
Purpose: to emphasize the most significant categories, typically for qualitative data.
Distinction from a standard bar chart: order matters — you must present the bars in descending order of frequency.
Example use: ranking categories by how often they occur, to identify the most impactful categories first.
In the lecture, a Pareto chart is presented as a preferred alternative when dealing with qualitative data and wanting to show the top contributors first.
Paired Datasets and Scatter Plots
When you have two quantitative variables and you want to compare them, a scatter plot is appropriate.
A scatter plot shows pairs of data points (x, y) that come from the same subject or unit (e.g., same person, same plant, same plot).
Important data integrity rule: the x and y values must come from the same unit (e.g., growth measurements for the same child or the same plot measured at two times).
What a scatter plot communicates:
The relationship between two quantitative variables (whether there is a trend, and what form it takes).
Observations about possible groups or clusters, or potential outliers.
Axes: each axis has a label describing the variable it represents.
Shape observations:
A linear pattern means the data roughly fall along a straight line; this indicates a linear relationship with a positive or negative slope.
Positive correlation: as x increases, y tends to increase.
Negative correlation: as x increases, y tends to decrease.
No correlation: no discernible linear pattern; points are scattered without a linear trend.
Note: correlation detects linear relationships; a non-linear pattern (e.g., quadratic) may have no linear correlation even if a clear relationship exists.
3D scatter plots exist (x, y, z) but are not covered in this course; they can visualize three variables at once.
Interpreting scatter plots:
Look for clusters, outliers, and overall direction (positive/negative/no correlation).
Correlation does not imply causation; it only describes association.
Choosing the Graphic for a Dataset (Practical Exercise from the Lecture)
The lecturer presents four datasets and asks which graphic to use (histogram, pie chart, Pareto chart, or scatter plot). Here is the inferred guidance based on the discussion:
Dataset 1: “colleges that high school seniors plan to attend” (qualitative categories: likely a yes/no or category-based choice). Recommendation: Pie chart or Pareto chart (both suitable for qualitative data); not a histogram because there are no numerical class intervals.
Dataset 2: “the number of points scored by each team in every NBA game for a season” (a single quantitative variable with many observations). Recommendation: Histogram (or relative frequency histogram); Pareto would be less natural because it would treat every game as a separate category rather than focusing on a numeric distribution.
Note from the lecturer: If you chose Pareto for Dataset 2, you would end up treating each game as a unique category, resulting in a very large number of bars (e.g., 82 × number of teams), which isn’t typically informative.
The general takeaway: choose the graph that best communicates the underlying data type and the intended story, while keeping in mind the strengths and limitations of each chart type.
Quick Formulas and Concepts to Remember
Midpoint of a class interval: where and are the lower and upper class limits.
Class width (for equally spaced classes): (adjustments by 0.5 may be used for computational or software alignment when using limits).
Relative frequency:
Pie chart sector angle for category i (in degrees): hetai = 360^^ e ext{(deg)} imes pi = 360^^ e ext{(deg)} imes \frac{n_i}{N}
Scatter plot correlation intuition: positive, negative, or no linear correlation; correlation measures linear association and should not be confused with non-linear relationships.
Basic scatter-plot interpretation: a linear pattern approximates a straight line; non-linear patterns may exist without linear correlation.
Practical Takeaways for Exam and Projects
Always ensure axes are labeled and the chart has a descriptive title.
For quantitative data with a single variable, use histograms (or relative frequency histograms) to show distribution; use Pareto charts for qualitative data ordered by frequency.
For qualitative data where you want to emphasize the most common categories, Pareto charts are often more informative than simple bar charts when ordering matters.
For comparing two quantitative variables, use scatter plots with properly labeled axes and, if relevant, consider the potential for linear vs non-linear relationships.
When presenting data, be mindful of the ethical implications of visualization choices (e.g., how the choice of chart, axis scaling, or labeling can influence interpretation or storytelling). Data visualization is a storytelling tool as much as an analytic tool.