Module 2: Descriptive Analysis and Presentation of Single-Variable Data

Descriptive Analysis and Presentation of Single-Variable Data

Overview of Descriptive Statistics

  • Descriptive statistics provide an overview of data through:
    • Summary Graphs
    • Measures of Central Tendency
    • Measures of Dispersion
    • Measures of Position
    • Statistical Concerns
  • Initially, the focus is on analyzing a single variable.

Summary Graphs

  • The type of graph used depends on the variable type:
    • Quantitative (Numerical Value): Stem and leaf diagrams, frequency histograms
    • Qualitative (Attribute): Circle graphs, bar graphs
Qualitative Data (Nominal and Ordinal)
  • Often uses circle graphs or bar graphs to show relative proportions in various categories.
  • Frequency: The number of observations in each category.
  • Relative Frequency: The percentage of observations in that category.
  • Graphs can show either frequency or relative frequency; the visual representation remains the same, but the data presented differs.
Circle Graphs (Qualitative Data)
  • Should include an informative title and a legend.
Bar Graphs (Qualitative Data)
  • Similar to circle graphs but use bars to show relative proportions.
    • Can represent frequency or relative frequency.
    • Should have an informative title, axes legends, and spaces between bars to indicate qualitative data.
Circle Graph vs. Bar Graph
  • Both are used to display sample results.
  • The choice depends on which graph shows the information more clearly.

Quantitative Data

  • Quantitative data can be discrete or continuous.
Stem and Leaf Diagram
  • Used for quantitative data.
  • Includes a diagram title identifying the variable and a key explaining stem and leaf components.
Frequency Distributions and Frequency Histograms
  • Examine values (xx) and their frequencies (ff).
  • Ungrouped Data: Frequencies are shown directly when there are few categories.
  • Grouped Data: Data is summarized into classes for larger datasets.
  • Classes should be equally spaced and non-overlapping.
  • A good approach for determining the number of classes is: \text{# classes} = \sqrt{n}
Grouped Frequencies
  • Frequency (ff) is the number of observations in each class.
  • Σn\Sigma n = the sum of the number of observations.
Frequency Histogram
  • Should include an informative title, axes labels, and bars without gaps to indicate quantitative data.
Relative Frequency Histogram
  • Shows the same information as a frequency histogram but uses relative frequencies.
  • When calculating relative frequencies, round to two decimal places.
Histogram Shapes
  • Symmetric: One side of the graph is a mirror image of the other.
  • Uniform: Every value occurs with the same frequency.
  • Skewed: One tail is stretched out longer than the other.
    • Skewed Right: Tail is longer on the right side.
    • Skewed Left: Tail is longer on the left side.
  • J-shaped: No tail on the side with the highest frequency.
  • Mode(s): One or more peaks in the data.
    Note:* When describing SHAPE you can have more than 1 mode even though heights are different
  • Normal: Symmetric and mounded around the mean, sparse at extremes (bell-shaped).
    • All normal curves are symmetric, but not all symmetric curves are normal.
    • Normal curves are unimodal, symmetric, and bell-shaped.
Outliers
  • Values that fall a significant distance away from the rest of the data points.
  • Not always present but should represent something unusual.

Key Concepts

  • Graph type depends on data type:
    • Qualitative: Circle graphs, bar graphs
    • Quantitative: Stem and leaf diagrams, frequency histograms
  • The main goal is to use the graph to describe sample data.

Measures of Central Tendency

  • Provide information about where the middle of your sample data occurs.
    • Sample Mean
    • Sample Median
    • Sample Mode
      Note:* these are for Quantitative Data
Sample Mean
  • The arithmetic mean.
  • Formula: xˉ=σxn\bar{x} = \frac{\sigma x}{n}
Sample Median
  • The middle value when data values are ranked.
  • Odd number of observations: the middle value.
  • Even number of observations: the average of the two middle values.
Sample Mode
  • The most frequent observation.
  • Can have more than one mode if multiple values have the same highest frequency.
    *Multiple statistical MODES only occur if frequencies are equal

Measures of Center

  • If the data are symmetrically and unimodally distributed, then the sample mean = median = mode.
  • If the data are NOT symmetric:
    • The sample mean is impacted the most.
    • Skewed Right: Mode MedianSample Mean impacted the most by some large or small values
      Data set 1: 1, 2, 2, 3, 4
      Mean = 2.4, Median = 2, Mode = 2
      Data set 1 with an outlier:
      1, 2, 2, 3, 4, 20
      Mean = 5.3, Median = 2.5, Mode = 2
Reasons and Impact of Using Different Measures (Mean, Median, Mode)
  1. Outliers: When outliers are suspected in the data, sample medians should be reported because the sample mean is impacted the most by some large or small values.
  2. Data utilization: Sample Mean uses all values whilst sample mode and median only use the middle value(s) or most frequent value(s)
  • Gas Price Example (Wall Street Journal Article):
    • The article discusses the most common gas price (3.29)versusthenationalaverage(3.29) versus the national average (3.79).
    • The mode is the actual price on display at more gas stations than any other price.
    • The average is skewed by ultra-high prices in California due to refinery shutdowns and higher taxes.
    • The median and mode are unaffected by California's unusually high prices, making them more relevant.
    • The average excluding California was close to GasBuddy's estimate of the mode.

Other examples of central tendency

  • Debate over Smallest Fish:
    • Paedocypris: adults can be as small as 7.9 mm long.
    • Male Deep Sea Anglerfish: just 6.2 mm long.
    • Stout Infantfish: 8.4 mm long, 1.5mg – the lightest adult vertebrate.
  • Just an Average Guy (Men’s Health magazine):
    • Age: 34.4 years
    • Weight: 175 lbs
    • Height: 5’10”
    • Drinks 3.3 cups of coffee and 1.2 alcoholic drinks a day.

Measures of Dispersion

  • Assess the spread of data values around the center.
    • Sample Range
    • Sample Variance
    • Sample Standard Deviation
Sample Range
  • Range = maximum sample value - minimum sample value
Sample Variance
  • The average squared deviation of the data.
    • Formula: s2=(xxˉ)2n1s^2 = \frac{\sum (x - \bar{x})^2}{n-1}
Sample Standard Deviation
  • The square root of the sample variance.
    • Formula: s=s2s = \sqrt{s^2}
    • Measures average variation in the data set.
    • Is always positive (unless all values are the same, then s=0s = 0).
  • The sample standard deviation is usually reported.

Data Summary

  • The sample mean estimates the center of the data.
  • The sample standard deviation estimates the spread of the data.
Standard Deviation and Sample Spread Relation

*Why is s larger in the second graph? In the second graph the same mean is spread across a wider range of data than the first one.

  • Batting Average Example:
    • Batting average is the ratio of hits to at-bats.
    • The standard deviation of batting averages has decreased over time, even though the batting averages have remained steady, which explains why there are no more 0.400 hitters.

Measures of Position

  • Summarize sample data using measures of center and spread.
    • Box and Whisker Plots
Quartiles or Percentages
  • Data ordered from smallest to largest value.
  • Quartiles divide observations into 25% intervals.

Box and Whisker Plot

  • Shows center and spread of data.

Density Curves

  • If enough data, pattern can be displayed as a smooth curve.
    • Note: We are now assuming enough sample data to estimate true population values.
      *Why density curve? Show clear population distribution instead of sample distribution that approximate to population data
  • Line always on or above the horizontal axis.
  • The area under the curve = 1.0.
  • Can take on any shape.
  • Typically don’t label vertical axis but can show probability or frequency.

Normal Curve

  • Bell-shaped, symmetric, and unimodal.
  • Shape described by the population mean ($\mu$) and the population standard deviation ($\sigma$).

Normal Probability Distribution Function.

IMPORTANT TO NOTE:* Normal Probability Distribution Function defined by 2 variables: $\mu$ (mu) the population mean and $\sigma$ (sigma) the population standard deviation.

  • Formula: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}

Statistical Concerns

  • Learn to be discerning about data presentation and interpretation.
    • Outliers hidden in means
    • Confusing graphs
    • Correlation is not causation
    • Hidden info and who did the study?
Outliers Hidden in Means
  • Means are affected by outliers, which can distort the representation of typical values.
  • Median value should be reported anytime outliers are suspected.
Confusing Graphs
  • Graphs can be deceiving.
    • Not showing full scale (truncated graphs)
    • Using pictures or figures instead of bars
    • Using 3-D bar graphs
    • Misinterpretation
      *Truncated graphs
  • 3-D bar graphs are hard to read and should be used cautiously to avoid confusion.
Correlation is Not Causation
  • Correlation occurs when two variables seem to change together.
    *However, if not tested experimentally, you can not imply that variable 1 causes variable 2 to change
Hidden Info and Who Did the Study
  • Important to ask if they are not telling you an important piece of information.
  • Important to ask who did the study and whether they have an agenda.

Federal Funding and Data Availability

  • Data should be available for download, reading, and analysis free of charge no later than 12 months after initial publication.

Module 2 Summary

  • Understand how the sample mean, median, mode, range, sample standard deviation, and sample variance are calculated.
  • Focus is on the purpose of these different measures and why use one over the other.
  • Focus on showing results of sample data using circle and bar graphs, stem and leaf diagrams, frequency histograms, and box and whisker plots.