Chapter 2 & 3 Notes: Data and Categorical Data

2.1 Data Tables

  • Data are a collection of numbers, labels, or symbols providing context.
  • A data table is a rectangular arrangement of data organized with rows and columns.
  • Observations or cases form the rows.
  • Common attributes or characteristics form the columns, referred to as variables.
  • Organize data to yield meaningful information.
  • Provide context, including who, what, and when.
  • Improve interpretability with meaningful names, formatting, and units.
  • Rows: Describe a collection of items (people) and are called observations or cases.
  • Columns: Describe common attributes shared by these items and are called variables.
  • The number of rows is denoted by nn.

2.2 Type of Variables

  • Classical system refines variables into four categories: nominal, ordinal, interval, and ratio.
    • Categorical: Nominal and ordinal variables.
    • Numerical: Interval and ratio variables.

2.2 Categorical and Numerical Data

  • Categorical Data
    • Also called qualitative variables.
    • Identify group membership.
    • Examples: Type of purchase, Brand of bike.
    • Other examples: race, sex, age group, and educational level. These can be numerical but are often more informative categorized.
  • Numerical Data
    • Also called quantitative or continuous variables.
    • Describe numerical properties of cases.
    • Have measurement units.
    • Examples: Size of bike (cm), Amount spent ($).

2.2 Measurement Scales

  • Nominal: Name categories without implying order (categorical).
  • Ordinal: Name categories that can be ordered (categorical).
  • Interval: Numerical values that can be added or subtracted (no absolute zero).
  • Ratio: Numerical values that can be added, subtracted, multiplied, or divided (makes ratio comparisons possible).

Categorical Data

  • Nominal: Categories do not follow a rank or order.
    • Examples: Marital Status, Gender.
  • Ordinal: Categories follow a rank or order.
    • Examples: Level of Education, Position.
  • Example: Customer survey asking "How would you rate our service?" with options like 5 = Great, 4 = Very good, etc.
  • Example: The brand of gasoline bought (nominal) vs. the grade of gasoline (ordinal).
  • Likert Scale (Ordinal - typically 5 to 7 categories).
    • Example question: "This MP3 player has all of the features that I want," with options from Strongly disagree to Strongly agree, rated on a scale of 1-7.

Numerical Data

  • Interval: Ratios between observations are not meaningful; zero is not implied as a possible value.
    • Examples: Credit score, temperature (Fahrenheit).
  • Ratio: Ratios between observations are meaningful; zero is implied as a possible value.
    • Examples: Income, Age.
  • Interval variables allow addition and subtraction.
  • Ratio variables allow multiplication and division.

2.3 Recoding and Aggregation

  • Recode: Building a new variable from another (e.g., recoding price into expensive or inexpensive).
  • Aggregate: Reduce rows in a data table by counting or summing values within categories.

2.3 Recoding

  • Involves rearranging data into a more convenient table.
  • Example: Recoding clothing purchases by size (Small, Medium, Large) instead of detailed descriptions.
  • Recoding produces a new column that consolidates labels to reduce the number of categories.

2.3 Aggregation

  • Summarizes data into a smaller table by summing values or counting cases within categories.
  • Aggregation generates a new data table with fewer rows, while recoding adds more columns.
  • Example: Aggregating daily purchases to count the number of purchases and sum the total value of purchases made each day.

2.4 Time Series

  • Time series: Data recorded over time.
  • Rows measure the variable over and over at different points in time.
  • Example: Recording the price of Microsoft's stock each day.
  • Timeplot: Graph of a time series showing values in chronological order.
  • Frequency: Regular time spacing of data (daily, monthly, etc.).

2.4 Cross-Sectional Data

  • Data observed at one point in time.
  • Example: Retail sales at Wal-Mart stores across the U.S. in March 2011.

Chapter 3 Describing Categorical Data

  • Focus on describing the variation in categorical variables.

3.1 Looking At Data

  • Frequency and Relative Frequency Tables
    • The distribution of a categorical variable is a list of values with its associated count (frequency).
    • A frequency table summarizes the distribution of a categorical variable.
    • A relative frequency table shows the proportion (or percentage) in each category.

3.2 Charts of Categorical Data

  • Bar Charts and Pie Charts
    • Charts are better than tables for summarizing more than five categories (unless exact counts are needed).
    • The two most common displays of a categorical variable are a bar chart and a pie chart.
  • Bar Chart
    • Uses horizontal or vertical bars to show the distribution of a categorical variable.
    • A Pareto chart sorts categories by frequency (popular in quality control).
    • Can become cluttered with too many categories.
    • Appropriate for ordinal categorical variables.
  • Pie Chart
    • Shows the distribution of a categorical variable as wedges of a circle.
    • Less useful than bar charts for comparing actual counts.
    • Good for showing how the whole divides into shares.

3.3 The Area Principle

  • The Fundamental Rule for Data Displays
    • The area occupied by a part of the graph/chart that displays data should be proportional to the amount of data it represents.
    • Charts decorated to attract attention often violate the area principle.

Example 3.2: SELLING SMARTPHONES TO BUSINESSES

  • Illustrates market share competition between Apple, Google, and Research in Motion (RIM).
  • Shows changes in smartphone sales to businesses from 2010 to 2011.
  • Blackberry sales grew less than sales of iPhones and Android phones from 2010 to 2011.

3.4 Mode and Median

  • Mode
    • Category with the highest frequency.
    • The longest bar in a bar chart.
    • The widest slice in a pie chart.
    • Two or more categories can tie with the highest frequency (bimodal or multimodal).
  • Median
    • Not appropriate for nominal data.
    • Data must be ordinal.
    • The category label of the middle observation in ordered data.

Best Practices

  • Use a bar chart to show the frequencies of a categorical variable.
  • Use a pie chart to show the proportions of a categorical variable.
  • Keep the baseline of a bar chart at zero.
  • Preserve the ordering of an ordinal variable.
  • Respect the area principle.
  • Show the best plots to answer the motivating question.
  • Label your chart to show the categories and indicate whether some have been combined or omitted.