Basic Data Visualization Notes (L03)
L03.1 Boxplots for Outliers and Z Scores
- Purpose: identify unusual observations in a numeric data set using boxplots and Z-scores.
- Boxplot method for outliers:
- Outlier fences are determined from the IQR (interquartile range).
- IQR = Q3 − Q1.
- Upper fence = Q3 + 1.5 · IQR.
- Lower fence = Q1 − 1.5 · IQR.
- Observations outside these fences are typically labeled as outliers.
- How to compute and interpret in practice:
- Boxplots visually show the spread of the middle 50% of data (between Q1 and Q3) and possible outliers outside the fences.
- Z-score method for outliers:
- Compute standardized values (Z-scores) to assess how far observations are from the mean in units of standard deviation.
- Observations with a Z-score outside ±3 are considered outliers.
- Z-score calculation (two common forms):
- Population/overall standardization: Z = \frac{X - \mu}{\sigma}
- Sample standardized value: Z = \frac{X - \bar{X}}{s}
- Minitab steps for boxplots and outliers:
- Boxplot: Graph → Boxplot → One Y/Simple → Select Vars → OK.
- Z-score (outliers): Minitab path might involve Calc → Standardize → Select Variable and location; or Descriptive Statistics path: Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections to view Z Scores outside ±3.
- Practical notes:
- Boxplot fences help distinguish typical spread from extreme values; outliers can affect measures of central tendency and spread.
- Z-scores depend on mean and standard deviation; they assume roughly normal distribution in interpreting distance from the mean.
L03.2 Skewness and Kurtosis
- Skewness:
- Symmetric distribution: skewness ≈ 0.
- Left-skewed: skewness < 0.
- Right-skewed: skewness > 0.
- Kurtosis (peakedness/heaviness of tails):
- Normal distribution: kurtosis ≈ 0 (excess kurtosis).
- Peakier distribution: kurtosis > 0.
- Flatter (less peaked) distribution: kurtosis < 0.
- How to compute in Minitab:
- Stat → Basic Statistics → Display Descriptive Statistics → Select Variable → Statistics → Make Selections.
- Significance and interpretation:
- Skewness indicates asymmetry around the mean and can affect parameter estimates and tests that assume normality.
- Kurtosis informs about tail heaviness and peakness relative to a normal distribution; influences tail risk and outlier expectations.
- Notes on use:
- Skewness and kurtosis are descriptive measures; use them alongside histograms and normality tests for a fuller picture.
L03.3 Visualizing One Numeric Variable: Histogram
- Purpose: visualize the distribution of a single numeric variable.
- Histogram in Minitab:
- Graph → Histogram → Simple → OK.
- What the histogram communicates:
- Shape of the distribution (symmetric, skewed, multimodal).
- Central tendency via the location of the bulk of data and spread via bin width and height.
- Practical considerations:
- Choice of bin width affects the apparent shape; too few bins can mask features, too many bins can introduce noise.
- Connections:
- Complements boxplots and descriptive statistics to understand distribution.
L03.4 Scatter Plot: The Relationship Between Two Numeric Variables
- Purpose: assess the relationship between two numeric variables.
- Example data: Restaurant2.mtw; examine whether “Popular Index” changes with “Cost.”
- Scatter plot steps in Minitab:
- Graph → Scatter Plot → Simple → OK.
- Optional: Graph → Scatter Plot → With Regression → OK to add a regression line.
- Interpretation questions:
- Do the two variables move together (positive relationship) or in opposite directions (negative relationship)?
- Is the relationship strong, moderate, or weak (based on pattern and, if using regression, the R-squared concept)?
- Practical implications:
- Scatter plots reveal trends, clusters, or potential nonlinear patterns that may warrant further modeling.
- Notes:
- Regression line (if added) summarizes linear association; correlation coefficient (r) quantifies strength/direction (not shown in transcript but commonly used).
L03.5 Visualizing Categorical Data: One Categorical Variable
- Visualizing a single categorical variable involves understanding relative frequencies.
- Relative frequency table:
- Stat → Tables → Tally Individual Variables…
- Bar charts:
- Graph options: Bar Chart → Simple (or Other types such as Cluster, Stack).
- Bars represent counts of unique values (frequencies) for each category.
- Example categories shown in transcript: Restaurant, Food, Decor, Service, Cost, Popularity Index, Cuisine, Region, Z Cost, etc.
- Bar Chart: Simple
- Used to display counts of each category for a single categorical variable.
- Data representation: height of each bar corresponds to the frequency or count.
- Key features:
- Helpful for comparing category frequencies at a glance.
- Can switch to alternative layouts (Cluster, Stack) to compare multiple variables or groupings within the same plot.
- Textual notes from example:
- Variable labels included: Cuisine, Cost, Region, etc.
- Example data shows counts like 0, 3, 8, etc. for category frequencies.
L03.5 Visualizing Categorical Data: Pie Chart
- Pie chart aims to show proportional shares of a categorical variable.
- Pie chart setup in Minitab:
- Graph → Pie Chart → Pie Chart.
- Chart counts of unique values; Chart values from a table.
- Categorical variable highlighted: Cuisine.
- Pie Options…, Labels…, Data Options… to adjust appearance and labeling.
- Example from transcript:
- Categories listed under Cuisine: American (New), Chinese, French, Indian, Italian, Japanese, Mexican, etc.
- The chart displays the proportion of each cuisine category among restaurants.
- Considerations:
- Pie charts are most effective when there are a small number of categories with meaningful shares; hard to compare many slivers.
- Use relative frequencies for intuitive understanding of shares.
L03.6 A Contingency Table: Visualizing Two Categorical Variables
Purpose: organize and summarize the relationship between two categorical variables.
Contingency table concept:
- Displays counts for combinations of two categorical factors (e.g., Cuisine × Region).
- Enables analysis of dependence or association between the variables.
Chi-square based approach (from transcript):
- Stat → Tables → Cross Tabulation and Chi-Square.
Key formula (general context):
- Expected counts: E{ij} = \frac{(Ri)(Cj)}{N} where Ri is the row total for row i, C_j is the column total for column j, and N is the grand total.
- Chi-square statistic: \chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E{ij}} where O{ij} are observed counts.
Practical notes:
- Cross-tabulations help identify patterns such as whether certain cuisines are more common in particular regions.
- Chi-square test assesses whether there is a statistically significant association between the two categorical variables (not detailed in transcript, but implied by the method name).
Connections to foundational principles:
- Descriptive statistics summarize data; visualizations (histograms, scatter plots, bar charts, pie charts, contingency tables) convey distribution, relationships, and composition.
- Basic data visualization practices rely on understanding data types (numeric vs categorical), distribution shape, and the appropriate summary metrics.
Practical tips and cautions:
- When interpreting skewness/kurtosis, consider sample size and the presence of outliers which can distort these measures.
- For histograms, choose bin widths thoughtfully to avoid masking or exaggerating features.
- In bar charts and pie charts, ensure category labels are clear and totals sum appropriately to avoid misinterpretation.
- In contingency tables, ensure adequate expected counts for chi-square validity; consider alternative tests if counts are low.
Summary of key formulas and commands
- Boxplot outlier fences:
- IQR = Q3 − Q1
- Upper fence = Q3 + 1.5 · IQR
- Lower fence = Q1 − 1.5 · IQR
- Z-scores:
- Z = \frac{X - \mu}{\sigma} or Z = \frac{X - \bar{X}}{s}
- Outliers: |Z| > 3
- Skewness interpretation:
- Symmetric: skewness ≈ 0
- Left skewed: skewness < 0
- Right skewed: skewness > 0
- Kurtosis interpretation (excess):
- Normal: kurtosis ≈ 0
- Peakier: kurtosis > 0
- Flatter: kurtosis < 0
- Histograms: no fixed equation; visualize distribution shape.
- Scatter plots: assess pattern and strength of linear association; regression line optional.
- Contingency table and chi-square:
- Expected counts: E{ij} = \frac{Ri C_j}{N}
- Chi-square: \chi^2 = \sumi \sumj \frac{(O{ij} - E{ij})^2}{E_{ij}}