Chapter 2 & 3 Notes: Data and Categorical Data
2.1 Data Tables
- Data are a collection of numbers, labels, or symbols providing context.
- A data table is a rectangular arrangement of data organized with rows and columns.
- Observations or cases form the rows.
- Common attributes or characteristics form the columns, referred to as variables.
- Organize data to yield meaningful information.
- Provide context, including who, what, and when.
- Improve interpretability with meaningful names, formatting, and units.
- Rows: Describe a collection of items (people) and are called observations or cases.
- Columns: Describe common attributes shared by these items and are called variables.
- The number of rows is denoted by n.
2.2 Type of Variables
- Classical system refines variables into four categories: nominal, ordinal, interval, and ratio.
- Categorical: Nominal and ordinal variables.
- Numerical: Interval and ratio variables.
2.2 Categorical and Numerical Data
- Categorical Data
- Also called qualitative variables.
- Identify group membership.
- Examples: Type of purchase, Brand of bike.
- Other examples: race, sex, age group, and educational level. These can be numerical but are often more informative categorized.
- Numerical Data
- Also called quantitative or continuous variables.
- Describe numerical properties of cases.
- Have measurement units.
- Examples: Size of bike (cm), Amount spent ($).
2.2 Measurement Scales
- Nominal: Name categories without implying order (categorical).
- Ordinal: Name categories that can be ordered (categorical).
- Interval: Numerical values that can be added or subtracted (no absolute zero).
- Ratio: Numerical values that can be added, subtracted, multiplied, or divided (makes ratio comparisons possible).
Categorical Data
- Nominal: Categories do not follow a rank or order.
- Examples: Marital Status, Gender.
- Ordinal: Categories follow a rank or order.
- Examples: Level of Education, Position.
- Example: Customer survey asking "How would you rate our service?" with options like 5 = Great, 4 = Very good, etc.
- Example: The brand of gasoline bought (nominal) vs. the grade of gasoline (ordinal).
- Likert Scale (Ordinal - typically 5 to 7 categories).
- Example question: "This MP3 player has all of the features that I want," with options from Strongly disagree to Strongly agree, rated on a scale of 1-7.
Numerical Data
- Interval: Ratios between observations are not meaningful; zero is not implied as a possible value.
- Examples: Credit score, temperature (Fahrenheit).
- Ratio: Ratios between observations are meaningful; zero is implied as a possible value.
- Interval variables allow addition and subtraction.
- Ratio variables allow multiplication and division.
2.3 Recoding and Aggregation
- Recode: Building a new variable from another (e.g., recoding price into expensive or inexpensive).
- Aggregate: Reduce rows in a data table by counting or summing values within categories.
2.3 Recoding
- Involves rearranging data into a more convenient table.
- Example: Recoding clothing purchases by size (Small, Medium, Large) instead of detailed descriptions.
- Recoding produces a new column that consolidates labels to reduce the number of categories.
2.3 Aggregation
- Summarizes data into a smaller table by summing values or counting cases within categories.
- Aggregation generates a new data table with fewer rows, while recoding adds more columns.
- Example: Aggregating daily purchases to count the number of purchases and sum the total value of purchases made each day.
2.4 Time Series
- Time series: Data recorded over time.
- Rows measure the variable over and over at different points in time.
- Example: Recording the price of Microsoft's stock each day.
- Timeplot: Graph of a time series showing values in chronological order.
- Frequency: Regular time spacing of data (daily, monthly, etc.).
2.4 Cross-Sectional Data
- Data observed at one point in time.
- Example: Retail sales at Wal-Mart stores across the U.S. in March 2011.
Chapter 3 Describing Categorical Data
- Focus on describing the variation in categorical variables.
3.1 Looking At Data
- Frequency and Relative Frequency Tables
- The distribution of a categorical variable is a list of values with its associated count (frequency).
- A frequency table summarizes the distribution of a categorical variable.
- A relative frequency table shows the proportion (or percentage) in each category.
3.2 Charts of Categorical Data
- Bar Charts and Pie Charts
- Charts are better than tables for summarizing more than five categories (unless exact counts are needed).
- The two most common displays of a categorical variable are a bar chart and a pie chart.
- Bar Chart
- Uses horizontal or vertical bars to show the distribution of a categorical variable.
- A Pareto chart sorts categories by frequency (popular in quality control).
- Can become cluttered with too many categories.
- Appropriate for ordinal categorical variables.
- Pie Chart
- Shows the distribution of a categorical variable as wedges of a circle.
- Less useful than bar charts for comparing actual counts.
- Good for showing how the whole divides into shares.
3.3 The Area Principle
- The Fundamental Rule for Data Displays
- The area occupied by a part of the graph/chart that displays data should be proportional to the amount of data it represents.
- Charts decorated to attract attention often violate the area principle.
Example 3.2: SELLING SMARTPHONES TO BUSINESSES
- Illustrates market share competition between Apple, Google, and Research in Motion (RIM).
- Shows changes in smartphone sales to businesses from 2010 to 2011.
- Blackberry sales grew less than sales of iPhones and Android phones from 2010 to 2011.
- Mode
- Category with the highest frequency.
- The longest bar in a bar chart.
- The widest slice in a pie chart.
- Two or more categories can tie with the highest frequency (bimodal or multimodal).
- Median
- Not appropriate for nominal data.
- Data must be ordinal.
- The category label of the middle observation in ordered data.
Best Practices
- Use a bar chart to show the frequencies of a categorical variable.
- Use a pie chart to show the proportions of a categorical variable.
- Keep the baseline of a bar chart at zero.
- Preserve the ordering of an ordinal variable.
- Respect the area principle.
- Show the best plots to answer the motivating question.
- Label your chart to show the categories and indicate whether some have been combined or omitted.