Chapter 2 & 3 Notes: Data and Categorical Data

Data are a collection of numbers, labels, or symbols providing context.
A data table is a rectangular arrangement of data organized with rows and columns.
Observations or cases form the rows.
Common attributes or characteristics form the columns, referred to as variables.
Organize data to yield meaningful information.
Provide context, including who, what, and when.
Improve interpretability with meaningful names, formatting, and units.
Rows: Describe a collection of items (people) and are called observations or cases.
Columns: Describe common attributes shared by these items and are called variables.
The number of rows is denoted by $n$ .

Classical system refines variables into four categories: nominal, ordinal, interval, and ratio.
- Categorical: Nominal and ordinal variables.
- Numerical: Interval and ratio variables.

Categorical Data
- Also called qualitative variables.
- Identify group membership.
- Examples: Type of purchase, Brand of bike.
- Other examples: race, sex, age group, and educational level. These can be numerical but are often more informative categorized.
Numerical Data
- Also called quantitative or continuous variables.
- Describe numerical properties of cases.
- Have measurement units.
- Examples: Size of bike (cm), Amount spent ($).

Nominal: Name categories without implying order (categorical).
Ordinal: Name categories that can be ordered (categorical).
Interval: Numerical values that can be added or subtracted (no absolute zero).
Ratio: Numerical values that can be added, subtracted, multiplied, or divided (makes ratio comparisons possible).

Nominal: Categories do not follow a rank or order.
- Examples: Marital Status, Gender.
Ordinal: Categories follow a rank or order.
- Examples: Level of Education, Position.
Example: Customer survey asking "How would you rate our service?" with options like 5 = Great, 4 = Very good, etc.
Example: The brand of gasoline bought (nominal) vs. the grade of gasoline (ordinal).
Likert Scale (Ordinal - typically 5 to 7 categories).
- Example question: "This MP3 player has all of the features that I want," with options from Strongly disagree to Strongly agree, rated on a scale of 1-7.

Interval: Ratios between observations are not meaningful; zero is not implied as a possible value.
- Examples: Credit score, temperature (Fahrenheit).
Ratio: Ratios between observations are meaningful; zero is implied as a possible value.
- Examples: Income, Age.
Interval variables allow addition and subtraction.
Ratio variables allow multiplication and division.

Recode: Building a new variable from another (e.g., recoding price into expensive or inexpensive).
Aggregate: Reduce rows in a data table by counting or summing values within categories.

Involves rearranging data into a more convenient table.
Example: Recoding clothing purchases by size (Small, Medium, Large) instead of detailed descriptions.
Recoding produces a new column that consolidates labels to reduce the number of categories.

Summarizes data into a smaller table by summing values or counting cases within categories.
Aggregation generates a new data table with fewer rows, while recoding adds more columns.
Example: Aggregating daily purchases to count the number of purchases and sum the total value of purchases made each day.

Frequency and Relative Frequency Tables
- The distribution of a categorical variable is a list of values with its associated count (frequency).
- A frequency table summarizes the distribution of a categorical variable.
- A relative frequency table shows the proportion (or percentage) in each category.

Bar Charts and Pie Charts
- Charts are better than tables for summarizing more than five categories (unless exact counts are needed).
- The two most common displays of a categorical variable are a bar chart and a pie chart.
Bar Chart
- Uses horizontal or vertical bars to show the distribution of a categorical variable.
- A Pareto chart sorts categories by frequency (popular in quality control).
- Can become cluttered with too many categories.
- Appropriate for ordinal categorical variables.
Pie Chart
- Shows the distribution of a categorical variable as wedges of a circle.
- Less useful than bar charts for comparing actual counts.
- Good for showing how the whole divides into shares.

The Fundamental Rule for Data Displays
- The area occupied by a part of the graph/chart that displays data should be proportional to the amount of data it represents.
- Charts decorated to attract attention often violate the area principle.

Illustrates market share competition between Apple, Google, and Research in Motion (RIM).
Shows changes in smartphone sales to businesses from 2010 to 2011.
Blackberry sales grew less than sales of iPhones and Android phones from 2010 to 2011.

Mode
- Category with the highest frequency.
- The longest bar in a bar chart.
- The widest slice in a pie chart.
- Two or more categories can tie with the highest frequency (bimodal or multimodal).
Median
- Not appropriate for nominal data.
- Data must be ordinal.
- The category label of the middle observation in ordered data.

Use a bar chart to show the frequencies of a categorical variable.
Use a pie chart to show the proportions of a categorical variable.
Keep the baseline of a bar chart at zero.
Preserve the ordering of an ordinal variable.
Respect the area principle.
Show the best plots to answer the motivating question.
Label your chart to show the categories and indicate whether some have been combined or omitted.