Statistics 9/3/25

Quantitative Data: Discrete vs Continuous

There are two types of quantitative data: discrete data and continuous data. Discrete data are counted values that, for practical purposes, can be treated much like qualitative data. In other words, you can summarize them by counting frequencies of distinct values. The transcript notes that when data are discrete, you would treat them similarly to qualitative data, and the main discussion here focuses on continuous data because it cannot be meaningfully listed as a long set of individual values.

When data are continuous, values can take on an infinite number of possibilities within a range. In practice, this means you cannot summarize by listing every single value with a frequency of one next to each value. Instead, you group the data into intervals (classes) so that each observation belongs to exactly one class. The grouping turns a long list of numbers into a manageable distribution across a fixed number of bins.

Grouping Continuous Data: Conditions and Strategy

To group continuous data, you must form non‑overlapping groups of equal width that together cover all data values. Three core ideas are emphasized:

All groups must have the same size (width).
The groups must not overlap.
All the data must fit into the groups (no data value in more than one group).

Practically, the process starts with deciding how many groups to use. The number of groups is not something you calculate computationally in the same way as width; rather, you choose a number (often based on experience or instructions). In this example, six groups are chosen for illustration, though the instructor notes that a typical guideline is to use somewhere between five and twenty groups to balance detail and readability.

Calculating Class Width

Once the number of groups k is chosen, the class width w is calculated from the data range:

w = \frac{\max - \min}{k}

For the data in this example, the smallest value is 23 and the largest value is 63.2, giving a range of 40.2. With six groups, the width is:

w = \frac{63.2 - 23}{6} = \frac{40.2}{6} \approx 6.7 \quad \Rightarrow \quad w = 7

A key practical rule is to round the class width up to a whole number. In this case, 6.7 is rounded up to 7. The instructor warns that not rounding up is a common mistake on exams.

Constructing Class Boundaries and Intervals

Start with the lowest data value as the lower boundary of the first class: 23. Add the class width to define the upper boundary, but note that the upper boundary is not included in the first class (the interval is closed on the left, open on the right). This yields the following six classes:

[23, 30),\ [30, 37),\ [37, 44),\ [44, 51),\ [51, 58),\ [58, 65)

These are interpreted as interval boundaries where the lower bound is included and the upper bound is excluded (left-closed, right-open). The data value 63.2 fits into the last class [58, 65).

Alternative Notation

Some textbooks present class intervals with decimal endpoints, for example 23 to 29.9, then 30 to 36.9, etc. This is another valid way to partition the data, depending on how many decimal places the data are recorded with. The instructor emphasizes that the specific notation of the boundaries is less important than the consistency and the non-overlapping, complete coverage of the data.

From Hand-Tiles to Software: Frequency Tables and Graphs

With continuous data, you typically create a frequency table that tallies how many observations fall into each class. The instructor walks through using StatCrunch to perform sorting, creating a frequency table, and generating a histogram.

Sorting the Data

To begin, sort the data in ascending (or descending) order. In StatCrunch, you can access data management options (Data) to save or load datasets. Sorting is done via the Sort option, selecting the relevant column, and choosing ascending order. In this example, the data are already in a column with 136 values; the smallest value is 23 and the largest is 63.2.

Creating the Frequency Table (Bin Boundaries)

In StatCrunch, to create a frequency table you use Tables -> Frequency. You select the variable and the statistic (Frequency). The critical step is to specify bin boundaries by checking Bin numerical values and entering the start and width:

Start: 23 (the smallest value)
Width: 7 (the class width determined earlier)

When you compute, you obtain a frequency table with the six classes and their frequencies. In the example, the frequencies are:

23 to 30: 21 observations
30 to 37: 39 observations
37 to 44: 31 observations
44 to 51: 28 observations
51 to 58: 9 observations
58 to 65: 8 observations

These frequencies can be read from the frequency table and correspond to the data distribution across the six classes. The instructor notes the notation used in StatCrunch (intervals with the left endpoint included and the right endpoint excluded) to align with the output.

Graphs for Continuous Data: Histogram vs Bar Graph

For continuous data, the appropriate graph is a histogram, which is a connected bar graph. Unlike a bar graph for qualitative data, the bars in a histogram touch each other to indicate continuity of the data. The histogram displays the distribution of the data across the six classes with the x-axis representing the class boundaries and the y-axis representing frequencies.

Building the Histogram in StatCrunch

To graph a histogram in StatCrunch you go to Graph -> Histogram, select the variable, and specify the bin start and bin width (as with the frequency table). It is important to manually set the start and width in order to reflect the chosen grouping, rather than relying on the software to choose defaults. The display options include whether to show values above bars (optional) and the exact bin boundaries on the axis.

The Effect of Bin Width on the Histogram

The class width (bin width) directly affects the histogram’s appearance. If you change the width, the number of bars changes, which alters how the data shape is perceived. A small width yields many bars and a potentially noisy picture; a large width yields fewer bars and may obscure detail.

Width = 7 (the chosen value) tends to produce a readable shape that reveals the overall pattern while retaining some detail.
Width = 9 yields fewer bars but similar shape; width = 15 produces only a few bars, losing detail.
Width = 5 or 3 produces many more bars, which can make the histogram look chaotic or overly detailed.

The instructor emphasizes that there is a balance: too many classes can make the distribution appear scattered, while too few classes can obscure important features of the distribution.

Interpreting Histogram Shapes: Skewness and Centeredness

When interpreting histograms, the shape helps describe the data’s distribution:

Right skew: Data are clustered toward the left with a tail extending to the right. The narrated example shows a distribution that appears to have most observations on the left and a tail to the right, which is identified as right-skew.
Left skew: Data are clustered toward the right with a tail extending to the left.
Normal (or symmetric): Most data are centered around a central value with a symmetric tail on both sides.
Uniform: All classes have approximately the same frequency; the histogram looks fairly flat.

The instructor notes that almost any dataset can be visually forced to resemble one of these four categories by choosing an appropriate number of classes, but the inherent skewness of the data cannot be changed by the choice of classes.

Practical Note on Skewness and Class Choice

In the example, the dataset is described as clustered on the left with a tail on the right, indicating a right-skewed distribution. Adjusting the number of classes (e.g., from 6 to 8) can make this pattern easier to see, but cannot change the underlying skew if the data truly follow that pattern.

Context: The Gini Index as a Real-World Example

A quick aside from the data-distribution discussion is the Gini index, a measure of income inequality within a country. The Gini index ranges from 0 to 100, where 0 indicates perfect equality (everyone earns the same), and higher values indicate greater inequality. The United States has a Gini index of 41.1 in the example. The essential points are:

The Gini index is a summary measure of inequality across a population.
The scale runs from 0 (perfect equality) to 100 (maximum inequality).
It provides a real-world context for why organizing and visualizing data matters, since different distributions correspond to different levels of inequality.

Summary of Key Formulas and Concepts

Class width (bin width):

w = \frac{\max - \min}{k}

Example with data min = 23, max = 63.2, k = 6:

w = \frac{63.2 - 23}{6} = \frac{40.2}{6} \approx 6.7 \Rightarrow w = 7

Class intervals (left-closed, right-open) for six classes starting at 23:

[23, 30),\ [30, 37),\ [37, 44),\ [44, 51),\ [51, 58),\ [58, 65)

Frequency table displays the count of observations in each class. In the example, the counts were:
21, 39, 31, 28, 9, 8 corresponding to the six classes above.
Gini index interpretation:

0 \le \text{Gini} \le 100, \quad \text{Gini} = 0 \text{ (perfect equality)}, \quad \text{Gini} = 100 \text{ (max inequality)}.

Practical Insights and Exam Reminders

For continuous data, grouping is necessary because there are too many distinct values to treat each one individually.
The number of classes is decided by the analyst (or by exam instructions) and typically falls within a reasonable range (5–20). The width must be rounded up to the next whole number to ensure all data are covered.
Non-overlapping, equal-width classes are essential, and all data must fit into the defined classes.
When using software like StatCrunch, be explicit about bin start and width to avoid default choices that may obscure the intended grouping.
A histogram is preferred for continuous data, with touching bars to reflect continuity, whereas a bar graph is appropriate for qualitative or discrete data.
Interpreting histograms involves assessing skewness (left, right, normal) and the degree of clustering, which can be influenced by the chosen class width but not the underlying data pattern.
Real-world contexts (like the Gini index) illustrate why data organization and visualization matter for interpreting distributions and making inferences.