Module 2: Descriptive Analysis and Presentation of Single-Variable Data
Descriptive Analysis and Presentation of Single-Variable Data
Overview of Descriptive Statistics
- Descriptive statistics provide an overview of data through:
- Summary Graphs
- Measures of Central Tendency
- Measures of Dispersion
- Measures of Position
- Statistical Concerns
- Initially, the focus is on analyzing a single variable.
Summary Graphs
- The type of graph used depends on the variable type:
- Quantitative (Numerical Value): Stem and leaf diagrams, frequency histograms
- Qualitative (Attribute): Circle graphs, bar graphs
Qualitative Data (Nominal and Ordinal)
- Often uses circle graphs or bar graphs to show relative proportions in various categories.
- Frequency: The number of observations in each category.
- Relative Frequency: The percentage of observations in that category.
- Graphs can show either frequency or relative frequency; the visual representation remains the same, but the data presented differs.
Circle Graphs (Qualitative Data)
- Should include an informative title and a legend.
Bar Graphs (Qualitative Data)
- Similar to circle graphs but use bars to show relative proportions.
- Can represent frequency or relative frequency.
- Should have an informative title, axes legends, and spaces between bars to indicate qualitative data.
Circle Graph vs. Bar Graph
- Both are used to display sample results.
- The choice depends on which graph shows the information more clearly.
Quantitative Data
- Quantitative data can be discrete or continuous.
Stem and Leaf Diagram
- Used for quantitative data.
- Includes a diagram title identifying the variable and a key explaining stem and leaf components.
Frequency Distributions and Frequency Histograms
- Examine values () and their frequencies ().
- Ungrouped Data: Frequencies are shown directly when there are few categories.
- Grouped Data: Data is summarized into classes for larger datasets.
- Classes should be equally spaced and non-overlapping.
- A good approach for determining the number of classes is: \text{# classes} = \sqrt{n}
Grouped Frequencies
- Frequency () is the number of observations in each class.
- = the sum of the number of observations.
Frequency Histogram
- Should include an informative title, axes labels, and bars without gaps to indicate quantitative data.
Relative Frequency Histogram
- Shows the same information as a frequency histogram but uses relative frequencies.
- When calculating relative frequencies, round to two decimal places.
Histogram Shapes
- Symmetric: One side of the graph is a mirror image of the other.
- Uniform: Every value occurs with the same frequency.
- Skewed: One tail is stretched out longer than the other.
- Skewed Right: Tail is longer on the right side.
- Skewed Left: Tail is longer on the left side.
- J-shaped: No tail on the side with the highest frequency.
- Mode(s): One or more peaks in the data.
Note:* When describing SHAPE you can have more than 1 mode even though heights are different - Normal: Symmetric and mounded around the mean, sparse at extremes (bell-shaped).
- All normal curves are symmetric, but not all symmetric curves are normal.
- Normal curves are unimodal, symmetric, and bell-shaped.
Outliers
- Values that fall a significant distance away from the rest of the data points.
- Not always present but should represent something unusual.
Key Concepts
- Graph type depends on data type:
- Qualitative: Circle graphs, bar graphs
- Quantitative: Stem and leaf diagrams, frequency histograms
- The main goal is to use the graph to describe sample data.
Measures of Central Tendency
- Provide information about where the middle of your sample data occurs.
- Sample Mean
- Sample Median
- Sample Mode
Note:* these are for Quantitative Data
Sample Mean
- The arithmetic mean.
- Formula:
Sample Median
- The middle value when data values are ranked.
- Odd number of observations: the middle value.
- Even number of observations: the average of the two middle values.
Sample Mode
- The most frequent observation.
- Can have more than one mode if multiple values have the same highest frequency.
*Multiple statistical MODES only occur if frequencies are equal
Measures of Center
- If the data are symmetrically and unimodally distributed, then the sample mean = median = mode.
- If the data are NOT symmetric:
- The sample mean is impacted the most.
- Skewed Right: Mode MedianSample Mean impacted the most by some large or small values
Data set 1: 1, 2, 2, 3, 4
Mean = 2.4, Median = 2, Mode = 2
Data set 1 with an outlier:
1, 2, 2, 3, 4, 20
Mean = 5.3, Median = 2.5, Mode = 2
Reasons and Impact of Using Different Measures (Mean, Median, Mode)
- Outliers: When outliers are suspected in the data, sample medians should be reported because the sample mean is impacted the most by some large or small values.
- Data utilization: Sample Mean uses all values whilst sample mode and median only use the middle value(s) or most frequent value(s)
- Gas Price Example (Wall Street Journal Article):
- The article discusses the most common gas price (3.79).
- The mode is the actual price on display at more gas stations than any other price.
- The average is skewed by ultra-high prices in California due to refinery shutdowns and higher taxes.
- The median and mode are unaffected by California's unusually high prices, making them more relevant.
- The average excluding California was close to GasBuddy's estimate of the mode.
Other examples of central tendency
- Debate over Smallest Fish:
- Paedocypris: adults can be as small as 7.9 mm long.
- Male Deep Sea Anglerfish: just 6.2 mm long.
- Stout Infantfish: 8.4 mm long, 1.5mg – the lightest adult vertebrate.
- Just an Average Guy (Men’s Health magazine):
- Age: 34.4 years
- Weight: 175 lbs
- Height: 5’10”
- Drinks 3.3 cups of coffee and 1.2 alcoholic drinks a day.
Measures of Dispersion
- Assess the spread of data values around the center.
- Sample Range
- Sample Variance
- Sample Standard Deviation
Sample Range
- Range = maximum sample value - minimum sample value
Sample Variance
- The average squared deviation of the data.
- Formula:
Sample Standard Deviation
- The square root of the sample variance.
- Formula:
- Measures average variation in the data set.
- Is always positive (unless all values are the same, then ).
- The sample standard deviation is usually reported.
Data Summary
- The sample mean estimates the center of the data.
- The sample standard deviation estimates the spread of the data.
Standard Deviation and Sample Spread Relation
*Why is s larger in the second graph? In the second graph the same mean is spread across a wider range of data than the first one.
- Batting Average Example:
- Batting average is the ratio of hits to at-bats.
- The standard deviation of batting averages has decreased over time, even though the batting averages have remained steady, which explains why there are no more 0.400 hitters.
Measures of Position
- Summarize sample data using measures of center and spread.
- Box and Whisker Plots
Quartiles or Percentages
- Data ordered from smallest to largest value.
- Quartiles divide observations into 25% intervals.
Box and Whisker Plot
- Shows center and spread of data.
Density Curves
- If enough data, pattern can be displayed as a smooth curve.
- Note: We are now assuming enough sample data to estimate true population values.
*Why density curve? Show clear population distribution instead of sample distribution that approximate to population data
- Note: We are now assuming enough sample data to estimate true population values.
- Line always on or above the horizontal axis.
- The area under the curve = 1.0.
- Can take on any shape.
- Typically don’t label vertical axis but can show probability or frequency.
Normal Curve
- Bell-shaped, symmetric, and unimodal.
- Shape described by the population mean ($\mu$) and the population standard deviation ($\sigma$).
Normal Probability Distribution Function.
IMPORTANT TO NOTE:* Normal Probability Distribution Function defined by 2 variables: $\mu$ (mu) the population mean and $\sigma$ (sigma) the population standard deviation.
- Formula:
Statistical Concerns
- Learn to be discerning about data presentation and interpretation.
- Outliers hidden in means
- Confusing graphs
- Correlation is not causation
- Hidden info and who did the study?
Outliers Hidden in Means
- Means are affected by outliers, which can distort the representation of typical values.
- Median value should be reported anytime outliers are suspected.
Confusing Graphs
- Graphs can be deceiving.
- Not showing full scale (truncated graphs)
- Using pictures or figures instead of bars
- Using 3-D bar graphs
- Misinterpretation
*Truncated graphs
- 3-D bar graphs are hard to read and should be used cautiously to avoid confusion.
Correlation is Not Causation
- Correlation occurs when two variables seem to change together.
*However, if not tested experimentally, you can not imply that variable 1 causes variable 2 to change
Hidden Info and Who Did the Study
- Important to ask if they are not telling you an important piece of information.
- Important to ask who did the study and whether they have an agenda.
Federal Funding and Data Availability
- Data should be available for download, reading, and analysis free of charge no later than 12 months after initial publication.
Module 2 Summary
- Understand how the sample mean, median, mode, range, sample standard deviation, and sample variance are calculated.
- Focus is on the purpose of these different measures and why use one over the other.
- Focus on showing results of sample data using circle and bar graphs, stem and leaf diagrams, frequency histograms, and box and whisker plots.