Comprehensive Study Notes on Contingency Tables, Frequency Distributions, Time Series, and Box Plots
Contingency tables, frequencies, and graphical displays
General structure of contingency tables
- There are two big categories (major groups) and two subcategories, totaling five categories for physician specialty and the corresponding surgery types recommended for early breast cancer.
- Rows or columns represent the two variables in the contingency table.
- The frequencies shown in contingency tables are the raw counts of observations (the numbers being counted).
- Marginal totals show the totals for each category across rows or columns (the totals for a given variable).
- Example concept: in a study of physician specialties, the marginal table shows how many physicians fall into each category.
Relative frequency tables and proportions
- A relative frequency table expresses the same data as proportions or percentages instead of counts.
- Proportion, p = \frac{\text{count of category}}{N}, where N is the total number of observations.
- Percentage, \text{percentage} = p \times 100.
- Proportions always lie in the interval 0 \le p \le 1; they cannot be negative and cannot exceed 1.
- In practice, we often use p because we usually have a sample rather than the entire population.
- When displaying relative frequencies graphically, bars (bar charts) and pie charts are the two common formats.
- Ensure labels and legends clearly indicate which color/section corresponds to which category.
Calculating and interpreting proportions with examples
- Example: from a study, the proportion of surgeons recommending radical surgery for all patients is computed from the relevant counts over the total sample.
- Another example: determine the proportion of surgeons who recommend conservative surgery for all patients by counting the surgeons in that category and dividing by the total number of surgeons.
- If you need the percentage, multiply the proportion by 100 to convert to a percent.
Data types and how to summarize them
- Discrete data: data that take on distinct, separate values (often counts).
- Continuous data: data that can take on an infinite number of values within an interval (often measured quantities).
- For discrete data, you can present raw counts; for continuous data, you often create a distribution through class intervals (bins).
Class intervals, class width, and distribution construction
- When turning continuous data into a frequency distribution, you create classes (bins) that partition the data.
- Class width (bin width): the difference between two consecutive class limits.
- A practical method to choose width:
- Approximate the data range: Range = (maximum observation) − (minimum observation).
- Decide the number of classes, k, and compute w \approx \frac{\text{Range}}{k}.
- Because w often comes out non-integer, round up to a whole number for ease of interpretation (e.g., 6.25 → 7).
- The lower limit of the first class is typically the minimum observation (or slightly below) to avoid missing data.
- Class limits should be mutually exclusive (no overlaps).
- After choosing w, determine the actual class boundaries and fill in the frequency for each class.
Distribution tables: frequencies, midpoints, and relative frequencies
- Frequency column: counts in each class.
- Midpoints: for each class, the midpoint is \text{midpoint} = \frac{\text{lower limit} + \text{upper limit}}{2}.
- Relative frequency column: \text{relative frequency}i = \frac{fi}{N}, where f_i is the frequency of class i and N is the total observations.
- Percent column: \text{percent}i = 100 \times \text{relative frequency}i.
- Cumulative frequency: Ck = \sum{i=1}^{k} f_i across classes from the lowest upwards.
- Cumulative relative frequency: Ck^{rel} = \sum{i=1}^{k} \text{relative frequency}_i.
Histograms and interpretation of distribution shape
- Histograms are used for continuous data (grouped data).
- No gaps between adjacent bars indicate continuous intervals.
- Shape indicators:
- Unimodal: one peak.
- Bimodal: two peaks.
- Multimodal: more than two peaks.
- Skewness: right-skew (long tail to the right, data pile up on the left) vs left-skew (long tail to the left, data pile up on the right).
- Use histograms to assess distribution shape, spread, and central tendency visually.
Time series and line plots
- Line plots are used to visualize time series data, where observations are recorded at regular time intervals.
- Examples: airport passenger counts over time, stock closing prices over days.
- Interpretation focuses on:
- Trend: long-run increase or decrease.
- Seasonality: regular short-term fluctuations within a fixed period (e.g., yearly seasonal patterns).
- Exponential or nonlinear patterns: nonstraight-line trends.
- To interpret trends, a longer time horizon helps distinguish genuine trend from short-term fluctuations.
Median, order, and the five-number summary
- The median is the middle value of a dataset when ordered from smallest to largest.
- If the dataset has an odd number of observations, the median is the middle value.
- If the dataset has an even number of observations, the median is the average of the two middle values: \text{median} = \frac{x{\frac{n}{2}} + x{\frac{n}{2} + 1}}{2}.
- The median is robust to outliers (it is not affected by extreme values).
- Five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), maximum.
- Q1 is the median of the lower half of the data.
- Q3 is the median of the upper half of the data.
- The quartiles split data into quarters; they help describe the distribution beyond the median.
Interquartile range (IQR) and outliers
- IQR = Q3 − Q1.
- Outliers are data points that lie far from the main body of the data.
- A common rule to identify outliers uses fences:
- Lower fence: \text{Lower fence} = Q1 - 1.5 \times \text{IQR}.
- Upper fence: \text{Upper fence} = Q3 + 1.5 \times \text{IQR}.
- Observations outside these fences are typically labeled as outliers.
- There are also "inner" and "outer" fences sometimes discussed in practice (tighter vs looser cutoffs), but the 1.5×IQR rule is the standard starting point.
Box plots (box-and-whisker plots) and symmetry
- Box plot visually summarizes the five-number summary: min, Q1, median, Q3, max (within fences; outliers may be plotted separately).
- The box spans from Q1 to Q3 with a line at the median; whiskers extend to the most extreme data points within the fences.
- Box plots allow quick comparisons of distributions across groups and help identify skewness and outliers.
Practical notes from the ozone example (continuous data visualization)
- In a continuous dataset (like ozone measurements), the histograms are typically drawn with adjacent, gap-free bars.
- The example discusses symmetry and the use of a box plot to compare distributions, highlighting that the median is not affected by outliers.
- Modifying an outlier can change the median in some cases, but by definition, the median is resistant to outliers, whereas the mean would be more affected.
Connections to foundational concepts and real-world relevance
- Contingency tables and relative frequencies connect to basic probability and inferential statistics (assessing relationships between categorical variables).
- Class intervals and histograms are fundamental for describing distributions in populations and samples, which underpins hypothesis testing and estimation.
- Time-series analysis (line plots) is essential for forecasting, resource planning, and understanding market or system dynamics.
- The five-number summary and box plots provide robust, quick summaries of data distributions useful in quality control, environmental science, and many applied fields.
Ethical and practical implications in data presentation
- Choosing the number of classes and class width can influence the apparent shape of the distribution; arbitrary choices may mislead if not justified.
- Clear labeling, units, and legends are essential to avoid misinterpretation.
- Relying solely on mean or a single number can be misleading in the presence of outliers; robust statistics like median and IQR offer more reliable summaries in skewed data.
Summary of key formulas to remember
- Proportion: p = \frac{\text{count}}{N}
- Percentage: \% = 100p
- Class width: w \approx \frac{\text{Range}}{k}, rounded up to a whole number
- Midpoint: \text{midpoint} = \frac{\text{lower limit} + \text{upper limit}}{2}
- Cumulative frequency: Ck = \sum{i=1}^{k} f_i
- Cumulative relative frequency: Ck^{rel} = \sum{i=1}^{k} p_i
- Five-number summary: min, Q1, median (Q2), Q3, max
- Median for even n: \text{median} = \frac{x{\frac{n}{2}} + x{\frac{n}{2} + 1}}{2}
- Interquartile range: \text{IQR} = Q3 - Q1
- Outlier fences:
- Lower fence: Q1 - 1.5\times \text{IQR}
- Upper fence: Q3 + 1.5\times \text{IQR}