Frequency Distribution Graphs and Distribution Shapes

A frequency distribution graph is a visual representation of data that conveys the same essential information as a frequency distribution table: the list of possible scores ( $X$ ) and the frequency ( $f$ ) associated with each of those scores.
All graphs in this format share a common structure using two axes: - The $X$ axis (horizontal axis): Represents the possible scores or intervals of scores. It tracks movement from left to right. - The $Y$ axis (vertical axis): Represents the frequency of those scores. It tracks movement up and down.
On these graphs, the height of a bar or a dot corresponds to the frequency of the score indicated on the horizontal axis.
It is critical to include all possible scores, even if their frequency is zero. For example, if a quiz score of $6$ is possible but no one achieved it, the graph should still show the score of $6$ with a height of zero.

The type of graph appropriate for a dataset depends entirely on the variable's scale of measurement. These are categorized into four types: Nominal, Ordinal, Interval, and Ratio ( $NOIR$ ).
Nominal and Ordinal Scales: - These are typically discrete variables, meaning they consist of separate, indivisible categories with no intermediate values. - Examples include college majors (nominal) where you are either one major or another, and military rank (ordinal) where you jump from one rank to the next upon promotion. - Discrete variables do not "bleed" into one another; they jump.
Interval and Ratio Scales: - these are generally treated as continuous variables, even if the data is reported in whole numbers. - Examples include temperature in Celsius or Fahrenheit (interval) and height, weight, or distance (ratio). - Continuous variables have "real limits," which are the boundaries that separate one score from the next on a continuous scale.

If the variable measured is on an interval or ratio scale, the two primary options for graphing are the histogram and the polygon.
The Histogram: - In a histogram, a bar is centered above each score. If the data is grouped, the bar is centered above each interval (e.g., $50-59$ ). - The Touch Rule: The bars in a histogram must touch each other. Adjacent bars share boundaries. - The practical reason for this is to represent the continuous nature of the variable; the width of the bars extends to the real limits of the scores.
The Polygon: - A dot is centered above each score at a height representing the frequency. - These dots are then connected by straight lines. - The Closing Rule: A polygon must be a closed shape, never a "floating line." To close the figure, an additional line must be drawn at each end to bring the frequency back to zero. - To properly close a polygon, the researcher must label the $X$ axis one unit higher than the highest observed score and one unit lower than the lowest observed score, then anchor the line to those points at a frequency of zero.

When categories come from a nominal or ordinal scale, the appropriate format is a bar graph (or bar chart).
The Gap Rule: In a bar graph, the bars must not touch. Spaces are placed between the bars to emphasize that the categories are discrete and separate.
Personality type (Type $A$ , $B$ , or $C$ ) is an example of a nominal variable requiring a bar graph. There is no numerical distance or order that suggests these types should run into each other.

A primary goal of data visualization is to present data so that the reader cannot easily miss details that would change the meaning of the findings.
Hash Marks (Break Marks): If the $X$ axis starts at a number other than zero (for example, starting at score $30$ because no children were shorter than $30$ inches), the researcher must use hash marks ( $//$ or a jagged line) at the beginning of the axis to alert the reader that values have been skipped.
Skipping numbers without a hash mark is misleading, as the reader might assume the first labeled point is score $1$ .
Grouped Data Labeling Preference: For grouped frequency distributions, while some label the boundaries, it is often simpler to place a hash mark in the center of the bar or dot and label it with the interval exactly as it appears in the table (e.g., $30-31$ , $32-33$ ).

When populations are extremely large, it is often impossible to know the exact frequency for any category.
Relative Frequency: Instead of exact numbers on the $Y$ axis, the graph uses relative heights to show ratios. For instance, if a lake has bluegill and bass, a researcher may not know the total count but can show that there are approximately twice as many bluegill as bass by doubling the height of that bar.
Smooth Curves: If the variable is continuous and the exact frequencies are unknown, a smooth curve is used instead of a jagged polygon or histogram.
The smooth curve indicates that the distribution is an estimation based on relative frequency rather than absolute counts.
A classic example is the Normal Distribution (Bell Curve), such as that seen in IQ scores. The highest frequency is at the average ( $100$ ), and the frequencies drop off predictably as you move toward higher or lower scores.

Symmetrical Distributions: These are characterized by a mirror-image relationship between the left and right sides. The normal distribution is the most common example.
Skewed Distributions: These occur when a distribution is not symmetrical and possesses a "tail" containing outliers (extreme values far from the average).
Positively Skewed (Right-Skewed): - In this distribution, the tail of outliers points toward the right (the positive end of the $X$ axis). - While most scores are concentrated on the left, the few extreme high values pull the tail to the right. - Example: Income and home prices. Most people earn near an average, but a few millionaires create a long right-pointing tail. - Example: A very difficult test where most students score poorly, but a few excel.
Negatively Skewed (Left-Skewed): - The tail points toward the left (the negative/lower end of the $X$ axis). - Most scores are concentrated on the right, but a few extremely low values pull the tail to the left. - Example: An easy test where almost everyone passes, but a few individuals who did not attend class fail.
Naming Convention: Skewness is always named after the direction of the tail (the outliers), not where the majority of the population is located.

Determining sample size ( $n$ ): To find the total number of individuals in a study from a graph, add the frequencies ( $f$ ) of every category. - Example: If a personality study shows $10$ people for Type $A$ , $5$ for Type $B$ , and $20$ for Type $C$ , then $n = 10 + 5 + 20 = 35$ .
Calculating Interval Width ( $i$ ): When using grouped data intervals, the width is calculated as $i = (High - Low) + 1$ . - Example: For an interval of $30-31$ , the width is $31 - 30 + 1 = 2$ .