Statistics Exam Review: Describing Data

Module Overview

  • Module 2: Describing Data
    • Topic 1: Categorical Variables
    • Topic 2: One Quantitative Variable: Shape and Center
    • Topic 3: One Quantitative Variable: Spread
    • Topic 4: Boxplots and Quantitative/Categorical Relationships
  • Covers Chapter 2 (Sections 1 – 4) in Lock5 Text

Review: Order of Operations

  • Method of Attack:
    • Ask a research question →
    • Design a study and collect data →
    • Organize data →
    • Visually explore data for patterns →
    • Summarize data numerically →
    • Draw inferences and formulate conclusions →
    • Look back and ahead

Review: Data

  • Definition:
    • Observations gathered for analysis
  • Dataset:
    • Consists of values for one or more variables measuring information for each case
  • Cases:
    • Subjects/objects of interest, also known as units or individuals
  • Variables:
    • Characteristics recorded for each case, corresponding to the columns in a data table

Review: Variable Types

  • Categorical Variable:
    • Divides cases into groups, placing each case into one or more categories
  • Quantitative Variable:
    • Measures or records a numerical quantity for each case

Review: Populations and Samples

  • Population:
    • All individuals or objects of interest, often the group being studied
  • Sample:
    • A subset of the population selected to provide information about it
    • Used because accessing the entire population is often impractical

Visual Depiction with Descriptive Statistics

  • **Sampling: **
    • N: Population size (often unknown)
    • n: Sample size
  • Statistical Inference:
    • Descriptive Statistics utilize numerical and graphical methods to identify patterns, summarize information, and present data meaningfully

Next Steps

  • After Data Collection:
    • Present or visualize the data using visual methods
    • Summarize the data using numerical methods
    • Visualization and summary statistics depend on the variable type(s) (categorical or quantitative)
    • Known as descriptive statistics or exploratory data analysis

Data Distribution

  • Definition:
    • Distribution of a variable indicates values taken and frequency of those values
  • Visualization Importance:
    • Enables analysis of data variation for both categorical and quantitative variables

Example: One Categorical Variable - Cell Phones

  • Survey Context:
    • A random sample of US adults surveyed in 2012 regarding cell phone ownership
  • Response Categories:
    • Android, iPhone, Blackberry, Non-smartphone, No cell phone

Example: Cell Phone Type - Results

  • Frequency Table:
    • Shows the number of cases in each category
    • Sample size collected: n = 2,253
    • Type of Cell Phone Frequency Data:
    • Android: 458
    • iPhone: 437
    • Blackberry: 141
    • Non-Smartphone: 924
    • No Cell Phone: 293
    • Total: 2,253

From Frequency to Relative Frequency

  • Definitions:
    • Frequency: Number of times a value occurs in a dataset
    • Relative Frequency: Frequency divided by total observations (n), also a proportion between 0 and 1
    • Percentage: Relative frequency multiplied by 100
  • Example Calculation:
    • Frequency of 3 with total observations of 6
    • Relative frequency: rac{3}{6} = 0.5
    • Percentage: 50%

Measurement: Proportion

  • Calculation:
    • proportion = rac{number ext{ }in ext{ }category}{total ext{ }number ext{ }of ext{ }observations}
  • Sample Proportion:
    • Denoted as ext{Ƹ}p when from a sample
  • Population Proportion:
    • Denoted as p when from a population

Example: Finding a Proportion of Cell Phone Non-ownership

  • Calculation:
    • Proportion of adults without a cell phone:
      ext{Ƹ}p = rac{293}{2253} = 0.13
    • Percentage: 13%

Example: Relative Frequency Table

  • Description:
    • Displays proportion of cases in each category
    • Sum of relative frequencies equals 1
  • Relative Frequency Data:
    • Android: 0.203
    • iPhone: 0.194
    • Blackberry: 0.063
    • Non-Smartphone: 0.410
    • No Cell Phone: 0.130
    • Total: 1

Graphing One Categorical Variable

  • Bar Chart:
    • Categories represented by bars, height corresponds to frequency, relative frequency, or percentage
    • X-axis: Categories; Y-axis: Frequency

Bar Chart Titling

  • Generic title:
    • "Distribution of 'Name of the Variable' for 'Number and describe the cases'"
  • X-axis Label:
    • Name of the Variable (e.g., "Cell phone Type")
  • Y-axis Label:
    • Number of 'Describe the cases' or Proportion

Good Practices when Constructing Bar Charts

  • Bar Chart Structure:
    • Equally spaced categories on the x-axis
    • Uniformly-width bars
    • Meaningful labeling
  • Y-axis Presentation:
    • Begin at zero, scale incrementally, create space above the tallest bar, and clarity in frequency or relative frequency

Example Bar Charts in Statistical Software

  • StatKey:
    • Titles and labels need to be manually added
  • Rguroo:
    • Titles and editable labels, bars ordered alphabetically or by frequency (Pareto Chart)

Pie Chart

  • Definition:
    • Circle divided into sections; area proportional to relative frequency of each category
  • Usefulness:
    • Depicts relative proportions of categories against the whole

Good Practices when Constructing Pie Charts

  • Generic title:
    • "Distribution of 'Name of the Variable' for 'Number and describe the cases'"
  • Slicing Colors:
    • Distinctive colors for differentiation
  • Labeling:
    • Each slice labeled with category name and display of a legend

What Do We See? Interpretation

  • Data Interpretation:
    • Analyze graph or chart and numerical summary for distribution insights
    • Emphasis on typical outcomes and variability for categorical variables

Typical Outcome

  • **Mode: **
    • The most frequently occurring category
    • In Bar Chart: Tallest bar
    • In Pie Chart: Largest slice
  • Types of Modes:
    • Unimodal: One mode
    • Bimodal: Two distinct modes
    • Multimodal: More than two modes

Variability

  • Definition:
    • Diversity of categories within distributions
    • Questions to consider: Why many or few categories? Are they evenly distributed? Is there one significantly larger than the rest?

Interpret the Distribution: Cell Phone Types

  • Observation Findings:
    • Most prevalent category: Non-Smartphone (~41%)
    • Distribution shows balanced representation across categories

Two Categorical Variable Relationships

  • Two-Way Table:
    • Shows relationship between two categorical variables
    • Rows represent one variable's categories, columns represent another's
    • Each cell denotes count of cases for those categories
    • Include totals for both variables in table margins

Example: Two Categorical Variables

  • Research Context:
    • Relationship between political philosophy and preferred news network
    • Sample: 893 regular TV news watchers

Example: Two-Way Table of Political Leaning vs. News Networks


  • Data Table:

ConservativeModerateLiberalTotal
Fox News1815047278
MSNBC103488132
CNN1112554190
Network News548767208
PBS23204285
Total279316298893

Example: Proportions Calculation

  • Total watching CNN:
    • Proportion: rac{190}{893} = 0.2128
    • Percentage: 21.28%

Example: Moderates Watching CNN

  • Proportion Calculation:
    • rac{125}{316} = 0.3956
    • Percentage: 39.56%

Example: Moderates Among CNN Watchers

  • Proportion Calculation:
    • rac{125}{190} = 0.6579
    • Percentage: 65.79%

Caution with Interpretation

  • Proportion distinctions matter:
    • Proportion of CNN watchers who are moderates vs. Proportion of moderates who watch CNN are not equivalent

Example: Compare Proportions of Fox News Watchers

  • Proportions:
    • Conservative: 181/278 = 0.6511
    • Liberal: 47/278 = 0.1691
    • Conclusion: Proportion of conservatives is greater

Two-Way Table Measurement: Difference

  • Definition:
    • Difference in proportions between categories of one variable against levels of another
  • Example Calculation:
    • ext{Ƹ}pc - ext{Ƹ}pl = 0.6511 - 0.1691 = 0.4820

Two Categorical Variable Graphs

  • Visualization Options:
    • Segmented Bar Chart:
    • Bar heights indicate frequency, segmented by categorical variable
    • Side-by-Side Bar Chart:
    • Separate charts for each category of one variable representing frequency for corresponding rows

Summary of Topics

  • One Categorical Variable:
    • Summary statistics: Frequency table, Proportion
    • Visualization: Bar chart, Pie chart
  • Two Categorical Variables:
    • Summary statistics: Two-way table, Difference in proportions
    • Visualization: Segmented or Side-by-Side bar chart