How to Summarize and Represent Data

1.4 Numerical Data

This section discusses how to summarize and explore numerical variables, focusing on two main characteristics:

Measures of Center

Mean (x̄): The average value, calculated by summing all observations and dividing by the number of observations. Sensitive to outliers.
Median: The middle value when data are sorted. Not affected by extreme values.

Tip: Use the median when the distribution is skewed or has outliers.

Measures of Spread

Standard Deviation (s): Measures the average distance of data points from the mean.
Variance (s²): The square of the standard deviation.
Interquartile Range (IQR): The range of the middle 50% of the data (Q3 − Q1).
Range: Max − Min; useful but sensitive to outliers.

Shape of Distributions

Symmetric: Mean ≈ Median.
Right-skewed: Mean > Median.
Left-skewed: Mean < Median.

Visual Tools

Histograms: Useful for showing shape, center, and spread.
Boxplots: Highlight median, IQR, and potential outliers.
Dotplots: Useful for small datasets.

1.5 Categorical Data

This brief section explains how to summarize data for variables with distinct categories (not numeric).

Key Concepts

Frequency Tables: Show the count for each category.
Bar Plots: Visual representation of frequencies or proportions.
Proportions (p̂): The number in a category divided by the total.
Relative Frequency Table: Shows proportions instead of counts.

When to Use

These methods are best when dealing with nominal or ordinal variables, such as treatment groups, genotypes, or survey responses.

1.6 Relationships Between Two Variables

This section introduces ways to describe and visualize the relationship between two variables, depending on their types:

Numerical vs. Numerical

Scatterplots: Graphs with points representing (x, y) pairs.
Correlation (r): Measures the strength and direction of a linear relationship between two quantitative variables. Ranges from −1 to +1.
Cautions: Correlation does not imply causation; outliers can distort correlation.

Categorical vs. Categorical

Contingency Tables: Show frequency counts for combinations of two categorical variables.
Segmented Bar Plots / Mosaic Plots: Visual tools for comparing proportions.

Categorical vs. Numerical

Side-by-side Boxplots: Compare distributions of a numerical variable across categories.
Dotplots: Useful for smaller datasets.

Key Idea: Identifying relationships helps in hypothesis generation, study design, and causal inference (when appropriate).