Notes on Descriptive Statistics (Qualitative and Quantitative)
2.1 Graphical and Numerical Methods for Describing Qualitative Data
Purpose: Describe qualitative observations by categorizing observations into mutually exclusive categories (or classes).
Data description: Use counts or proportions (relative frequencies) per category.
Example context: Safety of nuclear power reactors; hazards of energy use.
Qualitative variable: cause of fatal energy-related accidents.
Data: 62 accidents fall into six categories (causes).
Graphical descriptions for qualitative data:
Bar graphs: height/length proportional to category frequency (or relative frequency).
Pie charts: circle divided into slices; central angle of each slice proportional to the category relative frequency.
Pareto diagram:
A frequency bar graph with bars arranged in descending order of height (leftmost the tallest).
Named after Vilfredo Pareto (Italian economist).
Use: popular in process and quality control to highlight most frequent problems (e.g., defects, accidents, breakdowns).
2.1.1 Applied Exercises
Exercise 1.1: Social robots
Data: random sample of 106 social robots; 63 with legs only, 20 with wheels only, 8 with both legs and wheels, 15 with neither.
Tasks:
a. Identify the type of graph used to describe the data.
b. Identify the variable measured for each of the 106 designs.
c. Use the graph to identify the most-used social robot design.
d. Compute class relative frequencies for the categories.
e. Use the results in (d) to construct a Pareto diagram.
Exercise 2.1: STEM experiences for girls
NSF study: 174 young women in informal STEM programs; geographic location described via a pie chart (urban, suburban, rural).
Data: 107 urban, 57 suburban, 10 rural.
Tasks: construct the pie chart and interpret the results.
Exercise: Beach erosional hotspots
Data: data collected on six beach hotspots.
Tasks:
a. Identify each recorded variable as quantitative or qualitative.
b. Form a pie chart for the beach condition of the six hotspots.
c. Form a pie chart for the nearshore bar condition of the six hotspots.
d. Comment on the reliability of using pie charts to infer about all beach hotspots in the country.
Exercise 3.1: Railway track allocation
Data: 53 trains allocated to 11 tracks; table provided.
Task: Construct a Pareto diagram for the data and determine if track allocation is evenly distributed; identify underutilized and overutilized tracks.
2.2 Graphical Methods for Describing Quantitative Data
Descriptive goals: Describe, summarize, and detect patterns in quantitative data using three graphical methods:
Dot plots
Stem-and-leaf displays
Histograms
Software-friendly focus: Interpretation of these displays rather than their construction.
Example context: EPA mileage ratings (100 measurements) for a car model (Table 2.2 referenced).
Dot plot:
Horizontal axis: scale for the quantitative variable (e.g., miles per gallon, mpg).
Each measurement is represented by a dot at the rounded value (to the nearest 0.5 mpg).
Repeated values stack into piles above the same location.
Insight: almost all mpg values lie in the 30s; most between 35 and 40 mpg.
Stem-and-leaf display (MINITAB example, Figure 2.6):
Stem: left of the decimal point; Leaf: right of the decimal point.
Rows ordered from smallest stem (e.g., 30) to largest (e.g., 44).
Interpretation: shows distribution between 30.0 and 44.9 mpg; counts per stem show frequency; e.g., six leaves in stem row 34 indicate six observations in [34.0, 35.0).
Advantage: retains original measurements; easy to locate exact values (e.g., two observations at 36.3).
Limitation: can be unwieldy for very large data sets due to many stems/leaves.
Histogram (SPSS example, Figure 2.7):
Horizontal axis: mpg intervals (e.g., 30–31, 31–32, …, 44–45).
Vertical axis: frequency (count) of observations in each interval.
Insight: e.g., about 21 of 100 cars (21%) fall in [37, 38) mpg.
General notes:
Histograms give a good overall visual of the distribution for large data sets but do not display individual measurements.
Dot plots and stem-and-leaf displays show individual observations; stem-and-leaf sorts data in ascending order and highlights exact values, but can be large for big data sets.
2.2.1 Applied Exercises
Exercise 2.1: Sound waves from a basketball (American Journal of Physics, 2010)
Data: frequencies of the first 24 resonances (Hz) listed in a table.
Task: use a graphical method to describe the distribution of these frequencies.
Exercise 2.1: Surface roughness of pipe (Oil field pipes)
Data: 20 sample surface roughness measurements (in micrometers).
Task: describe the sample data with an appropriate graphical method.
2.3 Numerical Methods for Describing Quantitative Data
Numerical descriptive measures: numbers computed from a data set to describe its relative frequency distribution.
Three categories:
Measures of central tendency (location/center)
Measures of variation (spread)
Measures of relative standing (position of an observation within the data set)
Notation:
y denotes the observed variable; observations are y1, y2, …, yn.
Statistics: numerical descriptive measures computed from sample data.
Parameters: corresponding population measures; often denoted by Greek letters (e.g., μ for population mean).
Key concepts:
Sample mean:
Population mean: μ
Median, mode, and the idea of central tendency as core descriptive statistics.
2.4 Measures of Central Tendency
Three common measures:
Arithmetic mean: balance point of the relative frequency distribution.
Median: middle value; splits data into two equal-area halves.
Mode: value with the greatest frequency (peak of distribution).
Visual intuition (Figure 2.9 in text):
Mean is the balance point of the relative frequency distribution.
Median divides the distribution such that half the area is to the left and half to the right.
Mode corresponds to the peak (highest relative frequency).
Practical considerations:
Mean is sensitive to extreme values (outliers) and skewness; can be misleading in skewed data.
Median is a resistant measure of central tendency; not affected as much by extreme values; preferred for highly skewed data (e.g., starting salaries).
Mode is rarely the best measure unless the mode itself is of specific interest (e.g., modal nail length for a supplier).
Summary: the best measure depends on the type of descriptive information desired.
2.4.1 Applied Exercises
Exercise: Highest paid engineers (Electronic Design, 2012)
Given population means: software engineering manager mean = 126{,}417; manufacturing/production engineer mean = 92{,}360; assume mound-shaped distributions.
Statements (true/false):
a. All software engineering managers earn exactly 126{,}417. (False)
b. Half of manufacturing/production engineers earn less than 92{,}360. (False)
c. A randomly selected software engineering manager will always earn more than a randomly selected manufacturing/production engineer. (False)
Exercise: Ammonia in car exhaust (Environmental Science & Technology, 2000)
Data: eight daily afternoon ammonia concentrations (ppm).
Tasks: compute the mean; compute the median; interpret the values.
2.3–2.4 Further Note on Central Tendency and Variability
Additional context from subsequent materials (e.g., 3.1 Crude oil biodegradation exercise) introduces related calculations for means, medians, and modes on specific data sets; these extend the application of central tendency concepts to real data.
2.5 Measures of Variation
Rationale: Central tendency alone omits information about dispersion (spread).
Common measures of variation:
Range: difference between maximum and minimum values; simple but insensitive to distribution shape.
Variance: average squared deviation from the mean; units are squared of the original units (e.g.,
units^2).Standard deviation: square root of the variance; same units as the original data; easiest to interpret when used with the mean.
Relationship with mean:
When combined with the mean, standard deviation provides a practical sense of spread around the center.
Rules for interpretation:
Empirical Rule (for mound-shaped distributions):
Approximately 68% of observations lie within ar{y} \, ext{±} \, s.
Approximately 95% lie within ar{y} \, ext{±} \, 2s.
Approximately 99.7% lie within ar{y} \, ext{±} \, 3s.
Chebyshev’s Rule (for any distribution):
At least 1 - \frac{1}{k^2} portion of observations lie within k standard deviations of the mean for any k > 1.
2.5.1 Applied Exercises
Exercise: Highest paid engineers (mean = 126{,}417; sd = 15{,}000)
Distributions are mound-shaped and symmetric with given sd.
Task: sketch the distribution showing intervals ar{y} \pm \sigma, \bar{y} \pm 2\sigma, \bar{y} \pm 3\sigma and estimate the proportion in each interval.
Exercise: Ammonia in car exhaust
Data: eight daily afternoon ammonia concentrations (ppm) as in 2.1.
Tasks:
a. Find the range.
b. Find the variance.
c. Find the standard deviation.
d. If the std deviation during morning drive-time is 1.45 ppm, which time (morning or afternoon) is more variable?
Exercise: Bearing strength of FRP strips
Data: 10 FRP strip bearing strength measurements (MPa).
Task: use the sample data to provide an interval likely to contain the bearing strength.
2.6 Measures of Relative Standing
Purpose: Describe the location of an observation relative to the rest of the data.
Percentiles: specify the relative standing; e.g., the 84th percentile means 84% of observations are below that value.
Key percentiles:
25th percentile = lower quartile (Q1)
50th percentile = median (Q2)
75th percentile = upper quartile (Q3)
Z-scores:
Definition: a z-score describes the location of an observation y relative to the mean in units of the standard deviation.
Population form: z =\frac{y - \mu}{\sigma}
Sample form: z =\frac{y - \bar{y}}{s}
Interpretive guidance:
Negative z-scores indicate observations left of the mean; positive z-scores indicate right of the mean.
Empirical Rule implications: most observations have |z| < 2, and almost all have |z| < 3.
2.6.1 Applied Exercises
Exercise: Phosphorous standards in the Everglades
Given: 75th percentile of TP distribution is 10 μg/L.
Task: interpret this percentile value and justify why it was used as a standard.
Exercise: Lead in drinking water (EPA Action Level = 0.015 mg/L)
Statement: 90th percentile of a study sample is 0.00372 mg/L.
Question: Are customers at risk of unhealthy lead levels? Explain.
2.7 Methods for Detecting Outliers
Z-score method: A value with a large |z| may be an outlier.
Box plot method (based on IQR):
Define hinges: QL (lower quartile) and QU (upper quartile).
Median shown inside the box.
Inner fences: at distances 1.5 × IQR from hinges; observations beyond inner fences flagged as potential outliers using markers (e.g., *, 0).
Outer fences: at distances 3 × IQR from hinges; outer fences indicated if observations fall beyond them.
IQR-based fences are not affected by outliers, whereas z-scores can be inflated by extreme values that inflate the standard deviation.
Comparison:
Both methods provide rule-of-thumb criteria for labeling outliers.
Box-plot-based fences use quartiles and IQR, less sensitive to extreme values than standard deviation-based z-scores.
2.7.1 Applied Exercises
Exercise: Highest paid engineers
Setup: Distribution is mound-shaped and symmetric with sd = 15{,}000; salary claim = 180{,}000.
Task: Assess believability of the claim and explain.
Exercise: Barium content of clinkers
Given: summary statistics for 200 clinkers: QL = 115, m = 170, QU = 260.
Tasks:
a. Interpret the median m.
b. Interpret QL.
c. Interpret QU.
d. Compute IQR.
e. Endpoints of the inner fence for a box plot.
f. If no values fall beyond the inner fences, what does this imply?
Exercise: Zinc phosphide in sugarcane
Data: distribution of ZnP concentration ≈ N(mean = 2.0%, sd = 0.08%).
Question: If one batch contains 1.80%, does this indicate too little ZnP in production? Explain.
2.8 Supplementary Exercises
Exercise: Fate of scrapped tires
Data: categories and frequencies for tires’ fate; tasks include identifying variable, classes, class relative frequencies, pie chart, Pareto.
Exercise: Microbial fuel cells (MFCs)
Data: 54 articles categorized by investigation area; graph summarizes areas; tasks:
a. Identify the qualitative variable measured for each article.
b. Identify the type of graph.
c. Convert to a Pareto diagram and identify the dominant investigation area.
Exercise: Deep-hole drilling
Data: frequency histogram for drill chip lengths (50 chips).
Tasks:
a. Convert to a relative frequency histogram.
b. Based on the relative histogram, assess whether you would expect a drill chip of at least 190 mm.
4–6 Contextual and Practical Notes
Overall understanding: The materials emphasize choosing the right graphical method based on data type, the trade-offs between displaying individual observations vs. distribution summaries, and practical rules for identifying central tendency, variability, and outliers.
Connections to prior/statistical principles:
Concepts of consistency between graphical descriptions and numerical summaries.
Foundations for hypothesis testing and quality control (e.g., Pareto diagrams, outlier rules).
Real-world relevance:
Qualitative data handling in safety/hazards contexts, manufacturing quality control, and program evaluation.
Quantitative data analysis foundations applied to engineering, environmental science, and industrial operations.
Ethical/philosophical/practical implications:
The choice of descriptive method affects interpretation and decision-making (e.g., misrepresenting spread via inappropriate use of mean or histogram).
Outlier handling and data integrity considerations in reporting results.
Key Formulas and Concepts (Summary)
Relative frequency (proportion) for category i:
pi = \frac{ni}{n}Mean (sample):
\bar{y} = \frac{1}{n} \sum{i=1}^{n} yiZ-score (population):
z = \frac{y - \mu}{\sigma}Z-score (sample):
z = \frac{y - \bar{y}}{s}Interquartile range:
IQR = QU - QLInner fences (boxplot):
QL - 1.5 \cdot IQR, \, QU + 1.5 \cdot IQROuter fences (boxplot):
QL - 3 \cdot IQR, \, QU + 3 \cdot IQRPercentiles:
25th = Q1, 50th = Q2 = median, 75th = Q3
Empirical Rule (approximate for mound-shaped distributions):
Within \sigma: ~68%; within 2\sigma: ~95%; within 3\sigma: ~99.7%
Chebyshev’s Rule (general):
At least 1 - \frac{1}{k^2} of observations lie within k\sigma of the mean, for any distribution and k > 1
Note: Throughout, variables are denoted in the text as y (observations), with sample n observations (y1, y2, …, y_n). Population parameters are typically denoted by Greek letters (e.g., μ, σ).