Notes on Descriptive Statistics (Qualitative and Quantitative)

2.1 Graphical and Numerical Methods for Describing Qualitative Data

  • Purpose: Describe qualitative observations by categorizing observations into mutually exclusive categories (or classes).

  • Data description: Use counts or proportions (relative frequencies) per category.

  • Example context: Safety of nuclear power reactors; hazards of energy use.

    • Qualitative variable: cause of fatal energy-related accidents.

    • Data: 62 accidents fall into six categories (causes).

  • Graphical descriptions for qualitative data:

    • Bar graphs: height/length proportional to category frequency (or relative frequency).

    • Pie charts: circle divided into slices; central angle of each slice proportional to the category relative frequency.

  • Pareto diagram:

    • A frequency bar graph with bars arranged in descending order of height (leftmost the tallest).

    • Named after Vilfredo Pareto (Italian economist).

    • Use: popular in process and quality control to highlight most frequent problems (e.g., defects, accidents, breakdowns).

2.1.1 Applied Exercises

  • Exercise 1.1: Social robots

    • Data: random sample of 106 social robots; 63 with legs only, 20 with wheels only, 8 with both legs and wheels, 15 with neither.

    • Tasks:

    • a. Identify the type of graph used to describe the data.

    • b. Identify the variable measured for each of the 106 designs.

    • c. Use the graph to identify the most-used social robot design.

    • d. Compute class relative frequencies for the categories.

    • e. Use the results in (d) to construct a Pareto diagram.

  • Exercise 2.1: STEM experiences for girls

    • NSF study: 174 young women in informal STEM programs; geographic location described via a pie chart (urban, suburban, rural).

    • Data: 107 urban, 57 suburban, 10 rural.

    • Tasks: construct the pie chart and interpret the results.

  • Exercise: Beach erosional hotspots

    • Data: data collected on six beach hotspots.

    • Tasks:

    • a. Identify each recorded variable as quantitative or qualitative.

    • b. Form a pie chart for the beach condition of the six hotspots.

    • c. Form a pie chart for the nearshore bar condition of the six hotspots.

    • d. Comment on the reliability of using pie charts to infer about all beach hotspots in the country.

  • Exercise 3.1: Railway track allocation

    • Data: 53 trains allocated to 11 tracks; table provided.

    • Task: Construct a Pareto diagram for the data and determine if track allocation is evenly distributed; identify underutilized and overutilized tracks.

2.2 Graphical Methods for Describing Quantitative Data

  • Descriptive goals: Describe, summarize, and detect patterns in quantitative data using three graphical methods:

    • Dot plots

    • Stem-and-leaf displays

    • Histograms

  • Software-friendly focus: Interpretation of these displays rather than their construction.

  • Example context: EPA mileage ratings (100 measurements) for a car model (Table 2.2 referenced).

  • Dot plot:

    • Horizontal axis: scale for the quantitative variable (e.g., miles per gallon, mpg).

    • Each measurement is represented by a dot at the rounded value (to the nearest 0.5 mpg).

    • Repeated values stack into piles above the same location.

    • Insight: almost all mpg values lie in the 30s; most between 35 and 40 mpg.

  • Stem-and-leaf display (MINITAB example, Figure 2.6):

    • Stem: left of the decimal point; Leaf: right of the decimal point.

    • Rows ordered from smallest stem (e.g., 30) to largest (e.g., 44).

    • Interpretation: shows distribution between 30.0 and 44.9 mpg; counts per stem show frequency; e.g., six leaves in stem row 34 indicate six observations in [34.0, 35.0).

    • Advantage: retains original measurements; easy to locate exact values (e.g., two observations at 36.3).

    • Limitation: can be unwieldy for very large data sets due to many stems/leaves.

  • Histogram (SPSS example, Figure 2.7):

    • Horizontal axis: mpg intervals (e.g., 30–31, 31–32, …, 44–45).

    • Vertical axis: frequency (count) of observations in each interval.

    • Insight: e.g., about 21 of 100 cars (21%) fall in [37, 38) mpg.

  • General notes:

    • Histograms give a good overall visual of the distribution for large data sets but do not display individual measurements.

    • Dot plots and stem-and-leaf displays show individual observations; stem-and-leaf sorts data in ascending order and highlights exact values, but can be large for big data sets.

2.2.1 Applied Exercises

  • Exercise 2.1: Sound waves from a basketball (American Journal of Physics, 2010)

    • Data: frequencies of the first 24 resonances (Hz) listed in a table.

    • Task: use a graphical method to describe the distribution of these frequencies.

  • Exercise 2.1: Surface roughness of pipe (Oil field pipes)

    • Data: 20 sample surface roughness measurements (in micrometers).

    • Task: describe the sample data with an appropriate graphical method.

2.3 Numerical Methods for Describing Quantitative Data

  • Numerical descriptive measures: numbers computed from a data set to describe its relative frequency distribution.

  • Three categories:

    • Measures of central tendency (location/center)

    • Measures of variation (spread)

    • Measures of relative standing (position of an observation within the data set)

  • Notation:

    • y denotes the observed variable; observations are y1, y2, …, yn.

    • Statistics: numerical descriptive measures computed from sample data.

    • Parameters: corresponding population measures; often denoted by Greek letters (e.g., μ for population mean).

  • Key concepts:

    • Sample mean:

    • Population mean: μ

    • Median, mode, and the idea of central tendency as core descriptive statistics.

2.4 Measures of Central Tendency

  • Three common measures:

    • Arithmetic mean: balance point of the relative frequency distribution.

    • Median: middle value; splits data into two equal-area halves.

    • Mode: value with the greatest frequency (peak of distribution).

  • Visual intuition (Figure 2.9 in text):

    • Mean is the balance point of the relative frequency distribution.

    • Median divides the distribution such that half the area is to the left and half to the right.

    • Mode corresponds to the peak (highest relative frequency).

  • Practical considerations:

    • Mean is sensitive to extreme values (outliers) and skewness; can be misleading in skewed data.

    • Median is a resistant measure of central tendency; not affected as much by extreme values; preferred for highly skewed data (e.g., starting salaries).

    • Mode is rarely the best measure unless the mode itself is of specific interest (e.g., modal nail length for a supplier).

  • Summary: the best measure depends on the type of descriptive information desired.

2.4.1 Applied Exercises

  • Exercise: Highest paid engineers (Electronic Design, 2012)

    • Given population means: software engineering manager mean = 126{,}417; manufacturing/production engineer mean = 92{,}360; assume mound-shaped distributions.

    • Statements (true/false):

    • a. All software engineering managers earn exactly 126{,}417. (False)

    • b. Half of manufacturing/production engineers earn less than 92{,}360. (False)

    • c. A randomly selected software engineering manager will always earn more than a randomly selected manufacturing/production engineer. (False)

  • Exercise: Ammonia in car exhaust (Environmental Science & Technology, 2000)

    • Data: eight daily afternoon ammonia concentrations (ppm).

    • Tasks: compute the mean; compute the median; interpret the values.

2.3–2.4 Further Note on Central Tendency and Variability

  • Additional context from subsequent materials (e.g., 3.1 Crude oil biodegradation exercise) introduces related calculations for means, medians, and modes on specific data sets; these extend the application of central tendency concepts to real data.

2.5 Measures of Variation

  • Rationale: Central tendency alone omits information about dispersion (spread).

  • Common measures of variation:

    • Range: difference between maximum and minimum values; simple but insensitive to distribution shape.

    • Variance: average squared deviation from the mean; units are squared of the original units (e.g.,
      units^2).

    • Standard deviation: square root of the variance; same units as the original data; easiest to interpret when used with the mean.

  • Relationship with mean:

    • When combined with the mean, standard deviation provides a practical sense of spread around the center.

  • Rules for interpretation:

    • Empirical Rule (for mound-shaped distributions):

    • Approximately 68% of observations lie within ar{y} \, ext{±} \, s.

    • Approximately 95% lie within ar{y} \, ext{±} \, 2s.

    • Approximately 99.7% lie within ar{y} \, ext{±} \, 3s.

    • Chebyshev’s Rule (for any distribution):

    • At least 1 - \frac{1}{k^2} portion of observations lie within k standard deviations of the mean for any k > 1.

2.5.1 Applied Exercises

  • Exercise: Highest paid engineers (mean = 126{,}417; sd = 15{,}000)

    • Distributions are mound-shaped and symmetric with given sd.

    • Task: sketch the distribution showing intervals ar{y} \pm \sigma, \bar{y} \pm 2\sigma, \bar{y} \pm 3\sigma and estimate the proportion in each interval.

  • Exercise: Ammonia in car exhaust

    • Data: eight daily afternoon ammonia concentrations (ppm) as in 2.1.

    • Tasks:

    • a. Find the range.

    • b. Find the variance.

    • c. Find the standard deviation.

    • d. If the std deviation during morning drive-time is 1.45 ppm, which time (morning or afternoon) is more variable?

  • Exercise: Bearing strength of FRP strips

    • Data: 10 FRP strip bearing strength measurements (MPa).

    • Task: use the sample data to provide an interval likely to contain the bearing strength.

2.6 Measures of Relative Standing

  • Purpose: Describe the location of an observation relative to the rest of the data.

  • Percentiles: specify the relative standing; e.g., the 84th percentile means 84% of observations are below that value.

  • Key percentiles:

    • 25th percentile = lower quartile (Q1)

    • 50th percentile = median (Q2)

    • 75th percentile = upper quartile (Q3)

  • Z-scores:

    • Definition: a z-score describes the location of an observation y relative to the mean in units of the standard deviation.

    • Population form: z =\frac{y - \mu}{\sigma}

    • Sample form: z =\frac{y - \bar{y}}{s}

  • Interpretive guidance:

    • Negative z-scores indicate observations left of the mean; positive z-scores indicate right of the mean.

    • Empirical Rule implications: most observations have |z| < 2, and almost all have |z| < 3.

  • 2.6.1 Applied Exercises

    • Exercise: Phosphorous standards in the Everglades

    • Given: 75th percentile of TP distribution is 10 μg/L.

    • Task: interpret this percentile value and justify why it was used as a standard.

    • Exercise: Lead in drinking water (EPA Action Level = 0.015 mg/L)

    • Statement: 90th percentile of a study sample is 0.00372 mg/L.

    • Question: Are customers at risk of unhealthy lead levels? Explain.

2.7 Methods for Detecting Outliers

  • Z-score method: A value with a large |z| may be an outlier.

  • Box plot method (based on IQR):

    • Define hinges: QL (lower quartile) and QU (upper quartile).

    • Median shown inside the box.

    • Inner fences: at distances 1.5 × IQR from hinges; observations beyond inner fences flagged as potential outliers using markers (e.g., *, 0).

    • Outer fences: at distances 3 × IQR from hinges; outer fences indicated if observations fall beyond them.

    • IQR-based fences are not affected by outliers, whereas z-scores can be inflated by extreme values that inflate the standard deviation.

  • Comparison:

    • Both methods provide rule-of-thumb criteria for labeling outliers.

    • Box-plot-based fences use quartiles and IQR, less sensitive to extreme values than standard deviation-based z-scores.

2.7.1 Applied Exercises

  • Exercise: Highest paid engineers

    • Setup: Distribution is mound-shaped and symmetric with sd = 15{,}000; salary claim = 180{,}000.

    • Task: Assess believability of the claim and explain.

  • Exercise: Barium content of clinkers

    • Given: summary statistics for 200 clinkers: QL = 115, m = 170, QU = 260.

    • Tasks:

    • a. Interpret the median m.

    • b. Interpret QL.

    • c. Interpret QU.

    • d. Compute IQR.

    • e. Endpoints of the inner fence for a box plot.

    • f. If no values fall beyond the inner fences, what does this imply?

  • Exercise: Zinc phosphide in sugarcane

    • Data: distribution of ZnP concentration ≈ N(mean = 2.0%, sd = 0.08%).

    • Question: If one batch contains 1.80%, does this indicate too little ZnP in production? Explain.

2.8 Supplementary Exercises

  • Exercise: Fate of scrapped tires

    • Data: categories and frequencies for tires’ fate; tasks include identifying variable, classes, class relative frequencies, pie chart, Pareto.

  • Exercise: Microbial fuel cells (MFCs)

    • Data: 54 articles categorized by investigation area; graph summarizes areas; tasks:

    • a. Identify the qualitative variable measured for each article.

    • b. Identify the type of graph.

    • c. Convert to a Pareto diagram and identify the dominant investigation area.

  • Exercise: Deep-hole drilling

    • Data: frequency histogram for drill chip lengths (50 chips).

    • Tasks:

    • a. Convert to a relative frequency histogram.

    • b. Based on the relative histogram, assess whether you would expect a drill chip of at least 190 mm.

4–6 Contextual and Practical Notes

  • Overall understanding: The materials emphasize choosing the right graphical method based on data type, the trade-offs between displaying individual observations vs. distribution summaries, and practical rules for identifying central tendency, variability, and outliers.

  • Connections to prior/statistical principles:

    • Concepts of consistency between graphical descriptions and numerical summaries.

    • Foundations for hypothesis testing and quality control (e.g., Pareto diagrams, outlier rules).

  • Real-world relevance:

    • Qualitative data handling in safety/hazards contexts, manufacturing quality control, and program evaluation.

    • Quantitative data analysis foundations applied to engineering, environmental science, and industrial operations.

  • Ethical/philosophical/practical implications:

    • The choice of descriptive method affects interpretation and decision-making (e.g., misrepresenting spread via inappropriate use of mean or histogram).

    • Outlier handling and data integrity considerations in reporting results.

Key Formulas and Concepts (Summary)

  • Relative frequency (proportion) for category i:
    pi = \frac{ni}{n}

  • Mean (sample):
    \bar{y} = \frac{1}{n} \sum{i=1}^{n} yi

  • Z-score (population):
    z = \frac{y - \mu}{\sigma}

  • Z-score (sample):
    z = \frac{y - \bar{y}}{s}

  • Interquartile range:
    IQR = QU - QL

  • Inner fences (boxplot):
    QL - 1.5 \cdot IQR, \, QU + 1.5 \cdot IQR

  • Outer fences (boxplot):
    QL - 3 \cdot IQR, \, QU + 3 \cdot IQR

  • Percentiles:

    • 25th = Q1, 50th = Q2 = median, 75th = Q3

  • Empirical Rule (approximate for mound-shaped distributions):

    • Within \sigma: ~68%; within 2\sigma: ~95%; within 3\sigma: ~99.7%

  • Chebyshev’s Rule (general):

    • At least 1 - \frac{1}{k^2} of observations lie within k\sigma of the mean, for any distribution and k > 1

Note: Throughout, variables are denoted in the text as y (observations), with sample n observations (y1, y2, …, y_n). Population parameters are typically denoted by Greek letters (e.g., μ, σ).