Basic Data Visualization Notes (L03)

L03.1 Boxplots for Outliers and Z Scores

Purpose: identify unusual observations in a numeric data set using boxplots and Z-scores.
Boxplot method for outliers:
- Outlier fences are determined from the IQR (interquartile range).
- IQR = Q3 − Q1.
- Upper fence = Q3 + 1.5 · IQR.
- Lower fence = Q1 − 1.5 · IQR.
- Observations outside these fences are typically labeled as outliers.
How to compute and interpret in practice:
- Boxplots visually show the spread of the middle 50% of data (between Q1 and Q3) and possible outliers outside the fences.
Z-score method for outliers:
- Compute standardized values (Z-scores) to assess how far observations are from the mean in units of standard deviation.
- Observations with a Z-score outside ±3 are considered outliers.
Z-score calculation (two common forms):
- Population/overall standardization: $Z = \frac{X - \mu}{\sigma}$
- Sample standardized value: $Z = \frac{X - \bar{X}}{s}$
Minitab steps for boxplots and outliers:
- Boxplot: Graph → Boxplot → One Y/Simple → Select Vars → OK.
- Z-score (outliers): Minitab path might involve Calc → Standardize → Select Variable and location; or Descriptive Statistics path: Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections to view Z Scores outside ±3.
Practical notes:
- Boxplot fences help distinguish typical spread from extreme values; outliers can affect measures of central tendency and spread.
- Z-scores depend on mean and standard deviation; they assume roughly normal distribution in interpreting distance from the mean.

L03.2 Skewness and Kurtosis

Skewness:
- Symmetric distribution: skewness ≈ 0.
- Left-skewed: skewness < 0.
- Right-skewed: skewness > 0.
Kurtosis (peakedness/heaviness of tails):
- Normal distribution: kurtosis ≈ 0 (excess kurtosis).
- Peakier distribution: kurtosis > 0.
- Flatter (less peaked) distribution: kurtosis < 0.
How to compute in Minitab:
- Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections.
Significance and interpretation:
- Skewness indicates asymmetry around the mean and can affect parameter estimates and tests that assume normality.
- Kurtosis informs about tail heaviness and peakness relative to a normal distribution; influences tail risk and outlier expectations.
Notes on use:
- Skewness and kurtosis are descriptive measures; use them alongside histograms and normality tests for a fuller picture.

L03.3 Visualizing One Numeric Variable: Histogram

Purpose: visualize the distribution of a single numeric variable.
Histogram in Minitab:
- Graph → Histogram → Simple → OK.
What the histogram communicates:
- Shape of the distribution (symmetric, skewed, multimodal).
- Central tendency via the location of the bulk of data and spread via bin width and height.
Practical considerations:
- Choice of bin width affects the apparent shape; too few bins can mask features, too many bins can introduce noise.
Connections:
- Complements boxplots and descriptive statistics to understand distribution.

L03.4 Scatter Plot: The Relationship Between Two Numeric Variables

Purpose: assess the relationship between two numeric variables.
Example data: Restaurant2.mtw; examine whether “Popular Index” changes with “Cost.”
Scatter plot steps in Minitab:
- Graph → Scatter Plot → Simple → OK.
- Optional: Graph → Scatter Plot → With Regression → OK to add a regression line.
Interpretation questions:
- Do the two variables move together (positive relationship) or in opposite directions (negative relationship)?
- Is the relationship strong, moderate, or weak (based on pattern and, if using regression, the R-squared concept)?
Practical implications:
- Scatter plots reveal trends, clusters, or potential nonlinear patterns that may warrant further modeling.
Notes:
- Regression line (if added) summarizes linear association; correlation coefficient (r) quantifies strength/direction (not shown in transcript but commonly used).

L03.5 Visualizing Categorical Data: One Categorical Variable

Visualizing a single categorical variable involves understanding relative frequencies.
Relative frequency table:
- Stat → Tables → Tally Individual Variables…
Bar charts:
- Graph options: Bar Chart → Simple (or Other types such as Cluster, Stack).
- Bars represent counts of unique values (frequencies) for each category.
- Example categories shown in transcript: Restaurant, Food, Decor, Service, Cost, Popularity Index, Cuisine, Region, Z Cost, etc.
Bar Chart: Simple
- Used to display counts of each category for a single categorical variable.
- Data representation: height of each bar corresponds to the frequency or count.
Key features:
- Helpful for comparing category frequencies at a glance.
- Can switch to alternative layouts (Cluster, Stack) to compare multiple variables or groupings within the same plot.
Textual notes from example:
- Variable labels included: Cuisine, Cost, Region, etc.
- Example data shows counts like 0, 3, 8, etc. for category frequencies.

L03.5 Visualizing Categorical Data: Pie Chart

Pie chart aims to show proportional shares of a categorical variable.
Pie chart setup in Minitab:
- Graph → Pie Chart → Pie Chart.
- Chart counts of unique values; Chart values from a table.
- Categorical variable highlighted: Cuisine.
- Pie Options…, Labels…, Data Options… to adjust appearance and labeling.
Example from transcript:
- Categories listed under Cuisine: American (New), Chinese, French, Indian, Italian, Japanese, Mexican, etc.
- The chart displays the proportion of each cuisine category among restaurants.
Considerations:
- Pie charts are most effective when there are a small number of categories with meaningful shares; hard to compare many slivers.
- Use relative frequencies for intuitive understanding of shares.

L03.6 A Contingency Table: Visualizing Two Categorical Variables

Purpose: organize and summarize the relationship between two categorical variables.
Contingency table concept:
- Displays counts for combinations of two categorical factors (e.g., Cuisine × Region).
- Enables analysis of dependence or association between the variables.
Chi-square based approach (from transcript):
- Stat → Tables → Cross Tabulation and Chi-Square.
Key formula (general context):
- Expected counts: $E{ij} = \frac{(Ri)(Cj)}{N}$ where Ri is the row total for row i, C_j is the column total for column j, and N is the grand total.
- Chi-square statistic: $\chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E{ij}}$ where O{ij} are observed counts.
Practical notes:
- Cross-tabulations help identify patterns such as whether certain cuisines are more common in particular regions.
- Chi-square test assesses whether there is a statistically significant association between the two categorical variables (not detailed in transcript, but implied by the method name).
Connections to foundational principles:
- Descriptive statistics summarize data; visualizations (histograms, scatter plots, bar charts, pie charts, contingency tables) convey distribution, relationships, and composition.
- Basic data visualization practices rely on understanding data types (numeric vs categorical), distribution shape, and the appropriate summary metrics.
Practical tips and cautions:
- When interpreting skewness/kurtosis, consider sample size and the presence of outliers which can distort these measures.
- For histograms, choose bin widths thoughtfully to avoid masking or exaggerating features.
- In bar charts and pie charts, ensure category labels are clear and totals sum appropriately to avoid misinterpretation.
- In contingency tables, ensure adequate expected counts for chi-square validity; consider alternative tests if counts are low.

Summary of key formulas and commands

Boxplot outlier fences:
- IQR = Q3 − Q1
- Upper fence = Q3 + 1.5 · IQR
- Lower fence = Q1 − 1.5 · IQR
Z-scores:
- $Z = \frac{X - \mu}{\sigma}$ or $Z = \frac{X - \bar{X}}{s}$
- Outliers: |Z| > 3
Skewness interpretation:
- Symmetric: skewness ≈ 0
- Left skewed: skewness < 0
- Right skewed: skewness > 0
Kurtosis interpretation (excess):
- Normal: kurtosis ≈ 0
- Peakier: kurtosis > 0
- Flatter: kurtosis < 0
Histograms: no fixed equation; visualize distribution shape.
Scatter plots: assess pattern and strength of linear association; regression line optional.
Contingency table and chi-square:
- Expected counts: $E{ij} = \frac{Ri C_j}{N}$
- Chi-square: $\chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E_{ij}}$