Basic Data Visualization Notes (L03)

L03.1 Boxplots for Outliers and Z Scores

  • Purpose: identify unusual observations in a numeric data set using boxplots and Z-scores.
  • Boxplot method for outliers:
    • Outlier fences are determined from the IQR (interquartile range).
    • IQR = Q3 − Q1.
    • Upper fence = Q3 + 1.5 · IQR.
    • Lower fence = Q1 − 1.5 · IQR.
    • Observations outside these fences are typically labeled as outliers.
  • How to compute and interpret in practice:
    • Boxplots visually show the spread of the middle 50% of data (between Q1 and Q3) and possible outliers outside the fences.
  • Z-score method for outliers:
    • Compute standardized values (Z-scores) to assess how far observations are from the mean in units of standard deviation.
    • Observations with a Z-score outside ±3 are considered outliers.
  • Z-score calculation (two common forms):
    • Population/overall standardization: Z = \frac{X - \mu}{\sigma}
    • Sample standardized value: Z = \frac{X - \bar{X}}{s}
  • Minitab steps for boxplots and outliers:
    • Boxplot: Graph → Boxplot → One Y/Simple → Select Vars → OK.
    • Z-score (outliers): Minitab path might involve Calc → Standardize → Select Variable and location; or Descriptive Statistics path: Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections to view Z Scores outside ±3.
  • Practical notes:
    • Boxplot fences help distinguish typical spread from extreme values; outliers can affect measures of central tendency and spread.
    • Z-scores depend on mean and standard deviation; they assume roughly normal distribution in interpreting distance from the mean.

L03.2 Skewness and Kurtosis

  • Skewness:
    • Symmetric distribution: skewness ≈ 0.
    • Left-skewed: skewness < 0.
    • Right-skewed: skewness > 0.
  • Kurtosis (peakedness/heaviness of tails):
    • Normal distribution: kurtosis ≈ 0 (excess kurtosis).
    • Peakier distribution: kurtosis > 0.
    • Flatter (less peaked) distribution: kurtosis < 0.
  • How to compute in Minitab:
    • Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections.
  • Significance and interpretation:
    • Skewness indicates asymmetry around the mean and can affect parameter estimates and tests that assume normality.
    • Kurtosis informs about tail heaviness and peakness relative to a normal distribution; influences tail risk and outlier expectations.
  • Notes on use:
    • Skewness and kurtosis are descriptive measures; use them alongside histograms and normality tests for a fuller picture.

L03.3 Visualizing One Numeric Variable: Histogram

  • Purpose: visualize the distribution of a single numeric variable.
  • Histogram in Minitab:
    • Graph → Histogram → Simple → OK.
  • What the histogram communicates:
    • Shape of the distribution (symmetric, skewed, multimodal).
    • Central tendency via the location of the bulk of data and spread via bin width and height.
  • Practical considerations:
    • Choice of bin width affects the apparent shape; too few bins can mask features, too many bins can introduce noise.
  • Connections:
    • Complements boxplots and descriptive statistics to understand distribution.

L03.4 Scatter Plot: The Relationship Between Two Numeric Variables

  • Purpose: assess the relationship between two numeric variables.
  • Example data: Restaurant2.mtw; examine whether “Popular Index” changes with “Cost.”
  • Scatter plot steps in Minitab:
    • Graph → Scatter Plot → Simple → OK.
    • Optional: Graph → Scatter Plot → With Regression → OK to add a regression line.
  • Interpretation questions:
    • Do the two variables move together (positive relationship) or in opposite directions (negative relationship)?
    • Is the relationship strong, moderate, or weak (based on pattern and, if using regression, the R-squared concept)?
  • Practical implications:
    • Scatter plots reveal trends, clusters, or potential nonlinear patterns that may warrant further modeling.
  • Notes:
    • Regression line (if added) summarizes linear association; correlation coefficient (r) quantifies strength/direction (not shown in transcript but commonly used).

L03.5 Visualizing Categorical Data: One Categorical Variable

  • Visualizing a single categorical variable involves understanding relative frequencies.
  • Relative frequency table:
    • Stat → Tables → Tally Individual Variables…
  • Bar charts:
    • Graph options: Bar Chart → Simple (or Other types such as Cluster, Stack).
    • Bars represent counts of unique values (frequencies) for each category.
    • Example categories shown in transcript: Restaurant, Food, Decor, Service, Cost, Popularity Index, Cuisine, Region, Z Cost, etc.
  • Bar Chart: Simple
    • Used to display counts of each category for a single categorical variable.
    • Data representation: height of each bar corresponds to the frequency or count.
  • Key features:
    • Helpful for comparing category frequencies at a glance.
    • Can switch to alternative layouts (Cluster, Stack) to compare multiple variables or groupings within the same plot.
  • Textual notes from example:
    • Variable labels included: Cuisine, Cost, Region, etc.
    • Example data shows counts like 0, 3, 8, etc. for category frequencies.

L03.5 Visualizing Categorical Data: Pie Chart

  • Pie chart aims to show proportional shares of a categorical variable.
  • Pie chart setup in Minitab:
    • Graph → Pie Chart → Pie Chart.
    • Chart counts of unique values; Chart values from a table.
    • Categorical variable highlighted: Cuisine.
    • Pie Options…, Labels…, Data Options… to adjust appearance and labeling.
  • Example from transcript:
    • Categories listed under Cuisine: American (New), Chinese, French, Indian, Italian, Japanese, Mexican, etc.
    • The chart displays the proportion of each cuisine category among restaurants.
  • Considerations:
    • Pie charts are most effective when there are a small number of categories with meaningful shares; hard to compare many slivers.
    • Use relative frequencies for intuitive understanding of shares.

L03.6 A Contingency Table: Visualizing Two Categorical Variables

  • Purpose: organize and summarize the relationship between two categorical variables.

  • Contingency table concept:

    • Displays counts for combinations of two categorical factors (e.g., Cuisine × Region).
    • Enables analysis of dependence or association between the variables.
  • Chi-square based approach (from transcript):

    • Stat → Tables → Cross Tabulation and Chi-Square.
  • Key formula (general context):

    • Expected counts: E{ij} = \frac{(Ri)(Cj)}{N} where Ri is the row total for row i, C_j is the column total for column j, and N is the grand total.
    • Chi-square statistic: \chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E{ij}} where O{ij} are observed counts.
  • Practical notes:

    • Cross-tabulations help identify patterns such as whether certain cuisines are more common in particular regions.
    • Chi-square test assesses whether there is a statistically significant association between the two categorical variables (not detailed in transcript, but implied by the method name).
  • Connections to foundational principles:

    • Descriptive statistics summarize data; visualizations (histograms, scatter plots, bar charts, pie charts, contingency tables) convey distribution, relationships, and composition.
    • Basic data visualization practices rely on understanding data types (numeric vs categorical), distribution shape, and the appropriate summary metrics.
  • Practical tips and cautions:

    • When interpreting skewness/kurtosis, consider sample size and the presence of outliers which can distort these measures.
    • For histograms, choose bin widths thoughtfully to avoid masking or exaggerating features.
    • In bar charts and pie charts, ensure category labels are clear and totals sum appropriately to avoid misinterpretation.
    • In contingency tables, ensure adequate expected counts for chi-square validity; consider alternative tests if counts are low.

Summary of key formulas and commands

  • Boxplot outlier fences:
    • IQR = Q3 − Q1
    • Upper fence = Q3 + 1.5 · IQR
    • Lower fence = Q1 − 1.5 · IQR
  • Z-scores:
    • Z = \frac{X - \mu}{\sigma} or Z = \frac{X - \bar{X}}{s}
    • Outliers: |Z| > 3
  • Skewness interpretation:
    • Symmetric: skewness ≈ 0
    • Left skewed: skewness < 0
    • Right skewed: skewness > 0
  • Kurtosis interpretation (excess):
    • Normal: kurtosis ≈ 0
    • Peakier: kurtosis > 0
    • Flatter: kurtosis < 0
  • Histograms: no fixed equation; visualize distribution shape.
  • Scatter plots: assess pattern and strength of linear association; regression line optional.
  • Contingency table and chi-square:
    • Expected counts: E{ij} = \frac{Ri C_j}{N}
    • Chi-square: \chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E_{ij}}