Statistics for Two Categorical Variables & Bivariate Quantitative Data

Statistics for Two Categorical Variables: Summary Statistics and Association

Review of Graphical Displays for Two Categorical Variables

Constructing Graphical Displays:
- Side-by-side bar graph: Displays the distribution of one categorical variable for each category of the other categorical variable.
- Segmented bar graph: Stacks the bars on top of each other, where each bar represents a category of one variable and sections within the bar represent proportions of the other variable.
- Mosaic plot: Similar to a segmented bar graph but also varies the width of the bars based on the relative frequency of the groups.
Determining Association using Graphical Displays:
- If the distributions (proportions) within the segmented bar graph are not the same for each group, then there is an association between the two variables.

Calculating Summary Statistics for Two Categorical Variables

Objective: To calculate numerical summaries from two-way tables and use them to determine if an association exists.
Example Context: Age groups and educational attainment levels.
Two-Way Tables: The first step is to calculate all totals:
- Column Totals: Sum of frequencies in each column.
- Row Totals: Sum of frequencies in each row.
- Grand Total (Table Total): Sum of all frequencies in the table.

Types of Relative Frequencies

Joint Relative Frequency:
- Definition: A cell frequency (a single count in the table) divided by the total for the entire table.
- Question Example: "What percent of the people in the survey are 25 to 34 years old with a master's degree or higher?"
- Calculation: $(\frac{\text{Frequency of (25-34 yrs AND Master's+)}}{\text{Grand Total}})$
- If the specific observed value for "25-34 years old with a Master's degree or higher" is, for instance, a value of $41$ and the grand total for all people in the survey is $1966,$
  - This would be $(\frac{41}{1966}) \times 100\% \approx 2.1\%$ .
Marginal Relative Frequency:
- Definition: Row totals or column totals in a two-way table divided by the total for the entire table.
- Question Example 1 (Row Total): "What percent of the people in the survey have only high school diploma?"
  - Calculation: $(\frac{\text{Total with High School Diploma}}{\text{Grand Total}})$
  - If the total with a high school diploma is $767$ and the grand total is $1966,$
    - This calculation is $(\frac{767}{1966}) \times 100\% \approx 39\%$ .
- Question Example 2 (Column Total): "What percent of the people in the survey are 35 to 54 years old?"
  - Calculation: $(\frac{\text{Total for 35-54 Years Old}}{\text{Grand Total}})$
  - If the total for 35-54 years old is $739$ and the grand total is $1966,$
    - This calculation is $(\frac{739}{1966}) \times 100\% \approx 37.6\%$ .
Conditional Relative Frequency:
- Definition: Relative frequencies for a specific part of a two-way table, generally calculated within one row or within one column.
- Question Example 1 (Within a Row): "What percent of those with only a high school diploma are 35 to 54 years old?"
  - The condition "of those with only a high school diploma" means the denominator is the row total for high school diplomas.
  - Calculation: $(\frac{\text{Frequency of (35-54 yrs AND High School Diploma)}}{\text{Total with High School Diploma}})$
  - If the specific observed value for "35-54 years old with a High School Diploma" is $262$ and the row total for "High School Diploma" is $767,$
    - This calculation is $(\frac{262}{767}) \times 100\% \approx 34.1\%$ .
- Question Example 2 (Within a Column): "What percent of those 25 to 34 year olds have no high school diploma?"
  - The condition "of those 25 to 34 year olds" means the denominator is the column total for that age group.
  - Calculation: $(\frac{\text{Frequency of (25-34 yrs AND No High School Diploma)}}{\text{Total 25-34 Years Old}})$
  - These conditional relative frequencies are crucial for constructing segmented bar graphs that show the distribution of one variable within categories of another.

Determining Association using Summary Statistics

Method: Calculate the conditional relative frequencies for each group (e.g., the distribution of educational attainment within each age group).
Conclusion: If the distribution of conditional relative frequencies is different for each group, then the two variables are associated. This means knowing a person's age group helps predict their educational attainment level because the educational attainment profile changes across age groups.

Bivariate Data: Explanatory and Response Variables & Scatterplots

Introduction to Explanatory and Response Variables

Purpose: To understand how variables relate to each other in bivariate data.
Explanatory Variable (X): The variable that is thought to predict, explain, or influence the changes in another variable. It's often plotted on the horizontal (x) axis.
Response Variable (Y): The variable that measures the outcome or response. It's often plotted on the vertical (y) axis.
Example Context: The "income achievement gap" in mathematics education.
- Background: High/middle-income students tend to perform better on math exams than low-income students (on average), a gap consistent over time (NAEP data).
- Important Caveats: This data reflects group averages, not individual performance. It does not imply anything about innate intelligence, as many factors influence test performance.
- Factors: Wealth privilege and fewer educational barriers for higher-income students. Schools often aim to equalize opportunities.
- Income Attendance Gap: Higher-income areas typically have fewer chronically absent students, possibly due to easier transportation access or less need for students to work.
Proposed Solution (Hypothesis): Some school systems believe that improving attendance for low-income students will raise test scores (Poverty $\to$ Low Attendance; Fix Attendance $\to$ High Attendance $\to$ High Test Scores).
- This is a statement to be assessed for reasonableness (avoiding "bad statistics").
Data Example: A random sample of $11$ students.
- Variables: Percent of school days attended, and number of questions answered correctly on an Algebra 1 assessment.
- Identifying Variables:
  - Explanatory (X): Percent of school days attended (as attendance is hypothesized to explain test performance).
  - Response (Y): Number of questions answered correctly (the outcome or response to attendance).

Constructing a Scatterplot

When to Use: A scatterplot is used to visualize the relationship between two quantitative variables.
- Both "percent attendance" and "questions correct" are quantitative data, meaning they are numerical quantities with inherent order.
How to Construct: Each individual in the dataset is represented as a set of coordinates $(x, y)$ .
- The explanatory variable $(x)$ is plotted on the horizontal axis.
- The response variable $(y)$ is plotted on the vertical axis.
Essential Components for a Scatterplot (e.g., for AP Exam):
1. Title: A descriptive title that clearly states the context of the variables being displayed.
2. Labeled Axes: Both the x-axis and y-axis must be labeled with the variable names and appropriate units.
3. Tick Marks and Scale: Clearly show the scale on both axes with tick marks to prevent misleading interpretations of the data.

Describing a Scatterplot

Purpose: To provide a comprehensive summary of the trends and features observed in the relationship between two quantitative variables.
Key Aspects to Describe (Think "DFSUC" - Direction, Form, Strength, Unusual Features, Context):
1. Direction: Describes the overall trend of the y-values as the x-values increase.
  - Positive Association: As x-values increase, y-values tend to increase (upward trend from left to right).
  - Negative Association: As x-values increase, y-values tend to decrease (downward trend from left to right).
  - No Association: No discernable trend between x and y values.
  - Example from Data: A positive direction, meaning as a student's attendance rate increases, the number of exam questions they answer correctly also tends to increase.
2. Form: Describes the shape of the relationship.
  - Linear Form: Data points tend to follow a straight-line pattern.
  - Nonlinear Form: Data points follow a curved, shaped, or other non-straight pattern.
  - Example from Data: Appears to be approximately linear.
3. Strength: Describes how closely the data points follow the identified form (e.g., a linear path).
  - Strong Association: Data points lie very close to the pattern; the explanatory variable is a good predictor of the response variable, with little variation around the model.
  - Weak Association: Data points are spread out and loosely follow the pattern; the explanatory variable is a poor predictor, with significant variation around the model.
  - Example from Data: A strong relationship, indicating that attendance rate is a good predictor of exam performance, with data points close to a linear model.
4. Unusual Features: Points that deviate from the general pattern or clusters within the data.
  - Outliers: Individual data points that fall far outside the overall pattern of the relationship.
  - Clusters: Distinct groups of data points.
  - Example from Data: One student had unusually low attendance compared to the rest of the sample.
5. Context: Always frame the description in terms of the specific variables being studied.
Comprehensive Description Example (for Attendance Rate vs. Exam Performance):
- "The relationship between attendance rate and exam performance appears to be positive, linear, and strong. There appears to be one student with unusually low attendance."
Critical Thinking Caution: A strong positive relationship (correlation) does not automatically imply causation. If a school implements a policy to raise attendance, predicting a direct increase in test scores requires careful consideration beyond just observing correlation.
General Statistical Principles: When analyzing data, it is crucial to be critical, cautious, compassionate, and avoid "BS" (Bad Statistics), meaning making unfounded or misleading conclusions.