Notes on Graphical Descriptions of Data and Frequency Distributions

Graphing Data: Overview

  • Objective: Describe data visually using graphs. The session focuses on graphing data; a future session covers describing data through central tendency and dispersion. (Note: a third method is mentioned as part of the big three, but not detailed here.)

  • Graphing data helps you quickly assess the data without listing every value.

Bar charts vs histograms

  • Bar chart: bars have space between them; bar height/proportions reflect the number of cases in each category (or category frequency).

  • Histogram: bars touch each other; used for continuous data binnings.

  • Example interpretation: bar charts make it easy to see which categories are most or least common (e.g., blueberries most common, grapes least common).

Bar charts and histograms: practical tips

  • When examining a bar chart, compare bar heights to gauge relative frequencies.

  • Histograms convey distribution of continuous data across intervals; lack of gaps indicates continuous binning. (No Bars)

Stem-and-leaf plots

  • Purpose: A graphical data display that resembles a bar chart/ histogram on its side; often found in the literature.

  • Concept: Left column is the stem; right-side digits are leaves.

  • Reading example:

    • The data values can be reconstructed by combining stem and leaf digits (e.g., stem 1 and leaf 9 give 19; stem 2 and leaf 2 give 22; other values include 25, 26, 27, 28, 29).

  • Interpretation cues:

    • The left stem corresponds to the tens (e.g., 1 for 10s, 4 for 40s).

    • The row with the 4s (forties) is the most common range in the example.

    • In the 40s row, there were two leaves ‘8’ (i.e., two 48s), indicating multiple observations equal to 48.

  • Takeaway: Stem-and-leaf plots show both the distribution shape and the exact data values; useful for reading raw data directly from literature.

Reading and interpreting frequency tables (SPSS output)

  • SPSS: a widely used statistical software package; outputs are common in literature and coursework.

  • You do not need SPSS to complete the course, but you will encounter outputs like this and should be able to read them.

  • Key columns in the frequency table:

    • Frequency: raw count of observations at each score/value.

    • Percent: proportion of total observations (out of N, including missing data).

    • Valid Percent: proportion of observations among those with valid (non-missing) data, i.e.,
      ext{ValidPercent}i = \frac{fi}{N_{valid}} \times 100\%.

    • Cumulative Percent: running total of percentages for values up to and including the current value.

  • Important concepts:

    • If there are missing values, Percent and Valid Percent can differ because Percent uses total N and Valid Percent uses N_{valid}.

    • The cumulative percent column shows the percentage of cases scored at or below each value; e.g., a value like 70 with a cumulative percent around 50% indicates about half the students scored 70 or lower.

  • Example interpretation (from the transcript):

    • There is a score of 35 with Frequency = 1 and Percent = 1%.

    • Other scores in the lower range include 45 and 100, with 100 having Frequency = 2 (i.e., two students scored 100).

    • Total N = 100 (i.e., 100 students took the test).

    • Valid Percent equals Percent when there are no missing data; otherwise, they can differ.

  • Practical note:

    • This format lets you assess distributions, identify missing data impact, and understand how many students fall into specific score bins.

Frequency distributions and distribution shapes

  • Frequency distributions summarize continuous data by grouping into intervals or bins and plotting the distribution of frequencies across those bins.

  • They can resemble a smooth curve (e.g., a normal distribution) or more irregular shapes depending on the data.

  • Reading a frequency distribution: you’re looking at the overall shape of the data rather than individual values.

  • Common shapes observed in frequency distributions:

    • Normal distribution (bell-shaped curve).

    • Skewed distributions:

    • Positive skew (skew to the right): more values on the left with a tail extending to the right.

    • Negative skew (skew to the left): most values on the right with a tail extending to the left.

  • Kurtosis (peakedness):

    • Leptokurtic: very peaked distribution.

    • Platykurtic: flatter, broader peak.

Box plots (box-and-whisker plots)

  • Box plots summarize distribution using quartiles and central tendency.

  • Key components:

    • Median: the middle value of the data (center line inside the box).

    • Box boundaries: Q1 (25th percentile) and Q3 (75th percentile).

    • Whiskers: extend to the range of the data outside the box; outliers are typically shown as individual points beyond the whiskers.

    • Mean (often shown as a point in some plots, though the box plot primarily highlights the median).

  • Interpretive notes (as described in the transcript):

    • 25% of scores fall between the median and the 25th percentile (Q1).

    • 75% of scores fall within the 75th percentile (Q3) and above.

    • Outliers are indicated as separate points beyond the whiskers.

  • Terminology:

    • Box and whisker plot is another name for box plot.

  • Example intuition:

    • A box plot comparing traffic on different days (e.g., Friday vs Sunday) might show differences in central tendency and spread; an outlier such as 49 cars on a Sunday could indicate a holiday or special event.

  • Practical interpretation:

    • Box plots allow quick visual comparison across groups (e.g., days of the week) and identification of outliers.

Practical example: traffic data visualization

  • Example setup described: comparing Friday vs Sunday traffic.

    • On average, Friday traffic around ~20 cars; Sunday around ~15 cars.

    • One Sunday had a spike (e.g., 49 cars), illustrating an outlier.

  • How to use this visualization:

    • Assess typical vs. atypical days.

    • Consider possible reasons for spikes (holiday, event) and plan further investigation.

  • Conceptual takeaway:

    • Box plots and other graphs facilitate quick, visual comparisons across categories or groups (e.g., days of the week).

Connections to broader course concepts and real-world relevance

  • Graphing data is a foundational step in exploratory data analysis (EDA) and precedes formal statistical testing.

  • Reading different graph types prepares you to interpret results from studies, including those reported in SPSS outputs.

  • Understanding distribution shapes (normal, skewness, kurtosis) informs expectations about statistical methods (e.g., assumptions of normality for parametric tests).

  • Awareness of missing data and the distinction between Percent and Valid Percent helps you avoid misinterpretations when data are incomplete.

  • Box plots are commonly used in systematic reviews and literature to compare distributions across studies or groups.

  • Ethical/practical implications:

    • Misinterpreting skewness or ignoring outliers can lead to incorrect conclusions.

    • Missing data can bias Percent-based interpretations; always check Valid Percent and N_{valid}.

Formulas and numerical references (LaTeX)

  • Cumulative percent for a value xi: \text{CumPercent}(xi) \,=\, \left(\sum{j \le i} fj\right) / N{valid} \times 100\%, where $fj$ is the frequency of value $j$ and $N_{valid}$ is the number of valid (non-missing) observations.

  • Percent of observations at value xi: \% = \dfrac{fi}{N} \times 100\%, where $N$ is the total number of observations (including any missing data).

  • Valid percent of value xi: \text{ValidPercent}i = \dfrac{fi}{N{valid}} \times 100\%.

  • Mean (for reference, not explicitly shown in the box plot discussion but commonly used):
    \bar{x} = \dfrac{1}{N} \sum{i=1}^{N} xi.

  • Box plot quartiles and IQR (definitions):
    Q1 = 25\text{th percentile},\quad Q2 = \text{median},\quad Q3 = 75\text{th percentile},\quad \text{IQR} = Q3 - Q_1.

  • Skewness intuition (qualitative):

    • Positive skew: tail to the right (more small values, few large outliers).

    • Negative skew: tail to the left (more large values, few small outliers).

  • Kurtosis intuition (qualitative):

    • Leptokurtic: more peaked than normal.

    • Platykurtic: flatter than normal.

Summary of key takeaways

  • Graphical methods (bar charts, histograms, stem-and-leaf plots, frequency tables, frequency distributions, and box plots) are essential tools for describing data visually and reading raw data when needed.

  • SPSS outputs are common in literature; you should be able to read frequencies, percents, valid percents, and cumulative percents, and recognize how missing data affect Percent vs Valid Percent.

  • Understanding distribution shape (normal, skewness, kurtosis) and central tendency measures (median, mean) informs appropriate analysis choices and interpretation.

  • Box plots provide a concise view of central tendency, dispersion, and outliers, and are especially useful for comparing groups (e.g., days, conditions) across studies or time.