Notes on Analysing Categorical Data
Analysing Categorical Data
Key Objectives for Topic 1
- Understand different data types: numerical and categorical.
- Discover pivot tables’ functionality.
- Gain insight into data visualization methods.
- Comprehend basic concepts of probability and probability distributions.
- Learn to use probability to identify relationships between variables.
- Introduction to binomial and multinomial distributions for advanced learners.
1.1 Types of Data
- Numerical (Quantitative) Data:
- Takes numeric values.
- Examples:
- Number of people in a household.
- Rate of unemployment.
- Categorical (Qualitative) Data:
- Classified into distinct categories.
- Examples:
- Industry type: manufacturing, construction.
- Gender: male, female.
- Educational level: high school, bachelor.
1.2 Visualising Data in Categories
- Example: Data on 5,000 Australians regarding medical conditions and exercise habits.
- Key Variables:
- Medical Conditions:
- Asthma, Cancer, Depression, Diabetes, Heart Disease, None of Above.
- Exercise Habits:
- Moderate and Minimal.
Organizing Categorical Data
- Frequency Distributions:
- Create using pivot tables in Excel.
- Example: 202 individuals diagnosed with Diabetes among 5,000 respondents.
Data Presentation Methods
- Bar Charts:
- The most common method for presenting categorical data.
- Advantages:
- Easy to read and interpret.
- Categories must not be joined as lines due to lack of natural order.
- Variations:
- Order of Categories: e.g., alphabetical or from most to least frequent (Pareto chart).
- Using Percentages:
- Present distributions as percentages for clearer insights.
- Pie Charts:
- Show distribution visually appealing but harder to read accurately compared to bar charts.
- Omitting Certain Categories:
- Problematic categories (e.g., ‘None of the Above’) can obscure other data.
1.3 Finding Patterns Across Different Characteristics
- Univariate vs. Bivariate Data:
- Univariate looks at one characteristic; bivariate analyzes two.
- Contingency Tables (Cross Tabulation):
- Useful for summarizing found relationships between variables (e.g., medical conditions and exercise types).
- Identifies potential relationships (e.g., higher diabetes rates in minimal exercisers).
Creating and Interpreting Bivariate Tables
- Displaying Data:
- Use pivot tables to analyze relations between categories.
- Present data as percentages for better usability:
- % of Row: Shows the prevalence relative to medical conditions.
- % of Column: Accounts for differences in population sizes of categories.
- Relevant Findings:
- Lack of exercise correlates with higher diabetes likelihood.
1.4 Probability Distributions for Categorical Data
- Descriptive analysis supplemented by probability understanding.
- Three formats for presenting bivariate tables:
- % of Total: Overall proportion of intersections translating to probabilities.
- % of Column: Relates to the specific “cause” variable to analyze impact.
- % of Row: Centers on the “effect” variable to analyze outcomes.
Key Probability Concepts
Marginal Probabilities:
- Focus on one characteristic at a time:
- Example: Pr(Diabetes) = 0.0404
Joint Probabilities:
- Probability that two events co-occur:
- Example: Pr(Diabetes ∩ Minimal Exercise) = 0.0292
Conditional Probability:
- Probability given a specific condition or pre-selected category:
- Example: Pr(Diabetes | Minimal Exercise) calculates from the joint probability relative to the marginal.
1.5 Are Two Events Independent?
Concept of Independence
- Two events are independent if occurring one does not affect the likelihood of the other.
- Example scenario with empirical likelihoods related to diabetes and exercise based on earlier frameworks.
Example Evaluation for Independence
- Job search program Success versus Participation:
- Raw Employment data shows stark differences based on program participation.
- Independence test performed using conditions:
- Employment probabilities exhibit clear differences, indicating dependency.
- Program Evaluation Application:
- Dependence analysis at the heart of evaluating program efficacy on targeted outcomes (employment rates, health outcomes).
- Importance of statistical tests for confirming probability similarities.
Conclusion
- The study emphasizes understanding categorical data analysis, visualization, and the critical role of probability in establishing relationships between variables and identifying potential dependency or independence in data sets.