Notes on Analysing Categorical Data

Analysing Categorical Data

Key Objectives for Topic 1

  • Understand different data types: numerical and categorical.
  • Discover pivot tables’ functionality.
  • Gain insight into data visualization methods.
  • Comprehend basic concepts of probability and probability distributions.
  • Learn to use probability to identify relationships between variables.
  • Introduction to binomial and multinomial distributions for advanced learners.

1.1 Types of Data

  • Numerical (Quantitative) Data:
    • Takes numeric values.
    • Examples:
    • Number of people in a household.
    • Rate of unemployment.
  • Categorical (Qualitative) Data:
    • Classified into distinct categories.
    • Examples:
    • Industry type: manufacturing, construction.
    • Gender: male, female.
    • Educational level: high school, bachelor.

1.2 Visualising Data in Categories

  • Example: Data on 5,000 Australians regarding medical conditions and exercise habits.
  • Key Variables:
    • Medical Conditions:
    • Asthma, Cancer, Depression, Diabetes, Heart Disease, None of Above.
    • Exercise Habits:
    • Moderate and Minimal.
Organizing Categorical Data
  • Frequency Distributions:
    • Create using pivot tables in Excel.
    • Example: 202 individuals diagnosed with Diabetes among 5,000 respondents.
Data Presentation Methods
  • Bar Charts:
    • The most common method for presenting categorical data.
    • Advantages:
    • Easy to read and interpret.
    • Categories must not be joined as lines due to lack of natural order.
    • Variations:
    • Order of Categories: e.g., alphabetical or from most to least frequent (Pareto chart).
    • Using Percentages:
      • Present distributions as percentages for clearer insights.
    • Pie Charts:
      • Show distribution visually appealing but harder to read accurately compared to bar charts.
    • Omitting Certain Categories:
      • Problematic categories (e.g., ‘None of the Above’) can obscure other data.

1.3 Finding Patterns Across Different Characteristics

  • Univariate vs. Bivariate Data:
    • Univariate looks at one characteristic; bivariate analyzes two.
  • Contingency Tables (Cross Tabulation):
    • Useful for summarizing found relationships between variables (e.g., medical conditions and exercise types).
    • Identifies potential relationships (e.g., higher diabetes rates in minimal exercisers).
Creating and Interpreting Bivariate Tables
  • Displaying Data:
    • Use pivot tables to analyze relations between categories.
    • Present data as percentages for better usability:
    • % of Row: Shows the prevalence relative to medical conditions.
    • % of Column: Accounts for differences in population sizes of categories.
    • Relevant Findings:
    • Lack of exercise correlates with higher diabetes likelihood.

1.4 Probability Distributions for Categorical Data

  • Descriptive analysis supplemented by probability understanding.
  • Three formats for presenting bivariate tables:
    1. % of Total: Overall proportion of intersections translating to probabilities.
    2. % of Column: Relates to the specific “cause” variable to analyze impact.
    3. % of Row: Centers on the “effect” variable to analyze outcomes.
Key Probability Concepts
  • Marginal Probabilities:

    • Focus on one characteristic at a time:
    • Example: Pr(Diabetes) = 0.0404
  • Joint Probabilities:

    • Probability that two events co-occur:
    • Example: Pr(Diabetes ∩ Minimal Exercise) = 0.0292
  • Conditional Probability:

    • Probability given a specific condition or pre-selected category:
    • Example: Pr(Diabetes | Minimal Exercise) calculates from the joint probability relative to the marginal.

1.5 Are Two Events Independent?

Concept of Independence
  • Two events are independent if occurring one does not affect the likelihood of the other.
  • Example scenario with empirical likelihoods related to diabetes and exercise based on earlier frameworks.
Example Evaluation for Independence
  • Job search program Success versus Participation:
    • Raw Employment data shows stark differences based on program participation.
    • Independence test performed using conditions:
    • Employment probabilities exhibit clear differences, indicating dependency.
  • Program Evaluation Application:
    • Dependence analysis at the heart of evaluating program efficacy on targeted outcomes (employment rates, health outcomes).
    • Importance of statistical tests for confirming probability similarities.
Conclusion
  • The study emphasizes understanding categorical data analysis, visualization, and the critical role of probability in establishing relationships between variables and identifying potential dependency or independence in data sets.