Comprehensive Study Notes on Contingency Tables, Frequency Distributions, Time Series, and Box Plots

Contingency tables, frequencies, and graphical displays

  • General structure of contingency tables

    • There are two big categories (major groups) and two subcategories, totaling five categories for physician specialty and the corresponding surgery types recommended for early breast cancer.
    • Rows or columns represent the two variables in the contingency table.
    • The frequencies shown in contingency tables are the raw counts of observations (the numbers being counted).
    • Marginal totals show the totals for each category across rows or columns (the totals for a given variable).
    • Example concept: in a study of physician specialties, the marginal table shows how many physicians fall into each category.
  • Relative frequency tables and proportions

    • A relative frequency table expresses the same data as proportions or percentages instead of counts.
    • Proportion, p = \frac{\text{count of category}}{N}, where N is the total number of observations.
    • Percentage, \text{percentage} = p \times 100.
    • Proportions always lie in the interval 0 \le p \le 1; they cannot be negative and cannot exceed 1.
    • In practice, we often use p because we usually have a sample rather than the entire population.
    • When displaying relative frequencies graphically, bars (bar charts) and pie charts are the two common formats.
    • Ensure labels and legends clearly indicate which color/section corresponds to which category.
  • Calculating and interpreting proportions with examples

    • Example: from a study, the proportion of surgeons recommending radical surgery for all patients is computed from the relevant counts over the total sample.
    • Another example: determine the proportion of surgeons who recommend conservative surgery for all patients by counting the surgeons in that category and dividing by the total number of surgeons.
    • If you need the percentage, multiply the proportion by 100 to convert to a percent.
  • Data types and how to summarize them

    • Discrete data: data that take on distinct, separate values (often counts).
    • Continuous data: data that can take on an infinite number of values within an interval (often measured quantities).
    • For discrete data, you can present raw counts; for continuous data, you often create a distribution through class intervals (bins).
  • Class intervals, class width, and distribution construction

    • When turning continuous data into a frequency distribution, you create classes (bins) that partition the data.
    • Class width (bin width): the difference between two consecutive class limits.
    • A practical method to choose width:
    • Approximate the data range: Range = (maximum observation) − (minimum observation).
    • Decide the number of classes, k, and compute w \approx \frac{\text{Range}}{k}.
    • Because w often comes out non-integer, round up to a whole number for ease of interpretation (e.g., 6.25 → 7).
    • The lower limit of the first class is typically the minimum observation (or slightly below) to avoid missing data.
    • Class limits should be mutually exclusive (no overlaps).
    • After choosing w, determine the actual class boundaries and fill in the frequency for each class.
  • Distribution tables: frequencies, midpoints, and relative frequencies

    • Frequency column: counts in each class.
    • Midpoints: for each class, the midpoint is \text{midpoint} = \frac{\text{lower limit} + \text{upper limit}}{2}.
    • Relative frequency column: \text{relative frequency}i = \frac{fi}{N}, where f_i is the frequency of class i and N is the total observations.
    • Percent column: \text{percent}i = 100 \times \text{relative frequency}i.
    • Cumulative frequency: Ck = \sum{i=1}^{k} f_i across classes from the lowest upwards.
    • Cumulative relative frequency: Ck^{rel} = \sum{i=1}^{k} \text{relative frequency}_i.
  • Histograms and interpretation of distribution shape

    • Histograms are used for continuous data (grouped data).
    • No gaps between adjacent bars indicate continuous intervals.
    • Shape indicators:
    • Unimodal: one peak.
    • Bimodal: two peaks.
    • Multimodal: more than two peaks.
    • Skewness: right-skew (long tail to the right, data pile up on the left) vs left-skew (long tail to the left, data pile up on the right).
    • Use histograms to assess distribution shape, spread, and central tendency visually.
  • Time series and line plots

    • Line plots are used to visualize time series data, where observations are recorded at regular time intervals.
    • Examples: airport passenger counts over time, stock closing prices over days.
    • Interpretation focuses on:
    • Trend: long-run increase or decrease.
    • Seasonality: regular short-term fluctuations within a fixed period (e.g., yearly seasonal patterns).
    • Exponential or nonlinear patterns: nonstraight-line trends.
    • To interpret trends, a longer time horizon helps distinguish genuine trend from short-term fluctuations.
  • Median, order, and the five-number summary

    • The median is the middle value of a dataset when ordered from smallest to largest.
    • If the dataset has an odd number of observations, the median is the middle value.
    • If the dataset has an even number of observations, the median is the average of the two middle values: \text{median} = \frac{x{\frac{n}{2}} + x{\frac{n}{2} + 1}}{2}.
    • The median is robust to outliers (it is not affected by extreme values).
    • Five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), maximum.
    • Q1 is the median of the lower half of the data.
    • Q3 is the median of the upper half of the data.
    • The quartiles split data into quarters; they help describe the distribution beyond the median.
  • Interquartile range (IQR) and outliers

    • IQR = Q3 − Q1.
    • Outliers are data points that lie far from the main body of the data.
    • A common rule to identify outliers uses fences:
    • Lower fence: \text{Lower fence} = Q1 - 1.5 \times \text{IQR}.
    • Upper fence: \text{Upper fence} = Q3 + 1.5 \times \text{IQR}.
    • Observations outside these fences are typically labeled as outliers.
    • There are also "inner" and "outer" fences sometimes discussed in practice (tighter vs looser cutoffs), but the 1.5×IQR rule is the standard starting point.
  • Box plots (box-and-whisker plots) and symmetry

    • Box plot visually summarizes the five-number summary: min, Q1, median, Q3, max (within fences; outliers may be plotted separately).
    • The box spans from Q1 to Q3 with a line at the median; whiskers extend to the most extreme data points within the fences.
    • Box plots allow quick comparisons of distributions across groups and help identify skewness and outliers.
  • Practical notes from the ozone example (continuous data visualization)

    • In a continuous dataset (like ozone measurements), the histograms are typically drawn with adjacent, gap-free bars.
    • The example discusses symmetry and the use of a box plot to compare distributions, highlighting that the median is not affected by outliers.
    • Modifying an outlier can change the median in some cases, but by definition, the median is resistant to outliers, whereas the mean would be more affected.
  • Connections to foundational concepts and real-world relevance

    • Contingency tables and relative frequencies connect to basic probability and inferential statistics (assessing relationships between categorical variables).
    • Class intervals and histograms are fundamental for describing distributions in populations and samples, which underpins hypothesis testing and estimation.
    • Time-series analysis (line plots) is essential for forecasting, resource planning, and understanding market or system dynamics.
    • The five-number summary and box plots provide robust, quick summaries of data distributions useful in quality control, environmental science, and many applied fields.
  • Ethical and practical implications in data presentation

    • Choosing the number of classes and class width can influence the apparent shape of the distribution; arbitrary choices may mislead if not justified.
    • Clear labeling, units, and legends are essential to avoid misinterpretation.
    • Relying solely on mean or a single number can be misleading in the presence of outliers; robust statistics like median and IQR offer more reliable summaries in skewed data.
  • Summary of key formulas to remember

    • Proportion: p = \frac{\text{count}}{N}
    • Percentage: \% = 100p
    • Class width: w \approx \frac{\text{Range}}{k}, rounded up to a whole number
    • Midpoint: \text{midpoint} = \frac{\text{lower limit} + \text{upper limit}}{2}
    • Cumulative frequency: Ck = \sum{i=1}^{k} f_i
    • Cumulative relative frequency: Ck^{rel} = \sum{i=1}^{k} p_i
    • Five-number summary: min, Q1, median (Q2), Q3, max
    • Median for even n: \text{median} = \frac{x{\frac{n}{2}} + x{\frac{n}{2} + 1}}{2}
    • Interquartile range: \text{IQR} = Q3 - Q1
    • Outlier fences:
    • Lower fence: Q1 - 1.5\times \text{IQR}
    • Upper fence: Q3 + 1.5\times \text{IQR}