Notes on Analysing Categorical Data

Numerical (Quantitative) Data:
- Takes numeric values.
- Examples:
- Number of people in a household.
- Rate of unemployment.
Categorical (Qualitative) Data:
- Classified into distinct categories.
- Examples:
- Industry type: manufacturing, construction.
- Gender: male, female.
- Educational level: high school, bachelor.

Example: Data on 5,000 Australians regarding medical conditions and exercise habits.
Key Variables:
- Medical Conditions:
- Asthma, Cancer, Depression, Diabetes, Heart Disease, None of Above.
- Exercise Habits:
- Moderate and Minimal.

Frequency Distributions:
- Create using pivot tables in Excel.
- Example: 202 individuals diagnosed with Diabetes among 5,000 respondents.

Bar Charts:
- The most common method for presenting categorical data.
- Advantages:
- Easy to read and interpret.
- Categories must not be joined as lines due to lack of natural order.
- Variations:
- Order of Categories: e.g., alphabetical or from most to least frequent (Pareto chart).
- Using Percentages:
  - Present distributions as percentages for clearer insights.
- Pie Charts:
  - Show distribution visually appealing but harder to read accurately compared to bar charts.
- Omitting Certain Categories:
  - Problematic categories (e.g., ‘None of the Above’) can obscure other data.

Univariate vs. Bivariate Data:
- Univariate looks at one characteristic; bivariate analyzes two.
Contingency Tables (Cross Tabulation):
- Useful for summarizing found relationships between variables (e.g., medical conditions and exercise types).
- Identifies potential relationships (e.g., higher diabetes rates in minimal exercisers).

Displaying Data:
- Use pivot tables to analyze relations between categories.
- Present data as percentages for better usability:
- % of Row: Shows the prevalence relative to medical conditions.
- % of Column: Accounts for differences in population sizes of categories.
- Relevant Findings:
- Lack of exercise correlates with higher diabetes likelihood.

Descriptive analysis supplemented by probability understanding.
Three formats for presenting bivariate tables:
1. % of Total: Overall proportion of intersections translating to probabilities.
2. % of Column: Relates to the specific “cause” variable to analyze impact.
3. % of Row: Centers on the “effect” variable to analyze outcomes.

Marginal Probabilities:
- Focus on one characteristic at a time:
- Example: Pr(Diabetes) = 0.0404
Joint Probabilities:
- Probability that two events co-occur:
- Example: Pr(Diabetes ∩ Minimal Exercise) = 0.0292
Conditional Probability:
- Probability given a specific condition or pre-selected category:
- Example: Pr(Diabetes | Minimal Exercise) calculates from the joint probability relative to the marginal.

Two events are independent if occurring one does not affect the likelihood of the other.
Example scenario with empirical likelihoods related to diabetes and exercise based on earlier frameworks.

Job search program Success versus Participation:
- Raw Employment data shows stark differences based on program participation.
- Independence test performed using conditions:
- Employment probabilities exhibit clear differences, indicating dependency.
Program Evaluation Application:
- Dependence analysis at the heart of evaluating program efficacy on targeted outcomes (employment rates, health outcomes).
- Importance of statistical tests for confirming probability similarities.

The study emphasizes understanding categorical data analysis, visualization, and the critical role of probability in establishing relationships between variables and identifying potential dependency or independence in data sets.