Statistics Exam Review: Describing Data
Module Overview
- Module 2: Describing Data
- Topic 1: Categorical Variables
- Topic 2: One Quantitative Variable: Shape and Center
- Topic 3: One Quantitative Variable: Spread
- Topic 4: Boxplots and Quantitative/Categorical Relationships
- Covers Chapter 2 (Sections 1 – 4) in Lock5 Text
Review: Order of Operations
- Method of Attack:
- Ask a research question →
- Design a study and collect data →
- Organize data →
- Visually explore data for patterns →
- Summarize data numerically →
- Draw inferences and formulate conclusions →
- Look back and ahead
Review: Data
- Definition:
- Observations gathered for analysis
- Dataset:
- Consists of values for one or more variables measuring information for each case
- Cases:
- Subjects/objects of interest, also known as units or individuals
- Variables:
- Characteristics recorded for each case, corresponding to the columns in a data table
Review: Variable Types
- Categorical Variable:
- Divides cases into groups, placing each case into one or more categories
- Quantitative Variable:
- Measures or records a numerical quantity for each case
Review: Populations and Samples
- Population:
- All individuals or objects of interest, often the group being studied
- Sample:
- A subset of the population selected to provide information about it
- Used because accessing the entire population is often impractical
Visual Depiction with Descriptive Statistics
- **Sampling: **
- N: Population size (often unknown)
- n: Sample size
- Statistical Inference:
- Descriptive Statistics utilize numerical and graphical methods to identify patterns, summarize information, and present data meaningfully
Next Steps
- After Data Collection:
- Present or visualize the data using visual methods
- Summarize the data using numerical methods
- Visualization and summary statistics depend on the variable type(s) (categorical or quantitative)
- Known as descriptive statistics or exploratory data analysis
Data Distribution
- Definition:
- Distribution of a variable indicates values taken and frequency of those values
- Visualization Importance:
- Enables analysis of data variation for both categorical and quantitative variables
Example: One Categorical Variable - Cell Phones
- Survey Context:
- A random sample of US adults surveyed in 2012 regarding cell phone ownership
- Response Categories:
- Android, iPhone, Blackberry, Non-smartphone, No cell phone
Example: Cell Phone Type - Results
- Frequency Table:
- Shows the number of cases in each category
- Sample size collected: n = 2,253
- Type of Cell Phone Frequency Data:
- Android: 458
- iPhone: 437
- Blackberry: 141
- Non-Smartphone: 924
- No Cell Phone: 293
- Total: 2,253
From Frequency to Relative Frequency
- Definitions:
- Frequency: Number of times a value occurs in a dataset
- Relative Frequency: Frequency divided by total observations (n), also a proportion between 0 and 1
- Percentage: Relative frequency multiplied by 100
- Example Calculation:
- Frequency of 3 with total observations of 6
- Relative frequency: rac{3}{6} = 0.5
- Percentage: 50%
Measurement: Proportion
- Calculation:
- proportion = rac{number ext{ }in ext{ }category}{total ext{ }number ext{ }of ext{ }observations}
- Sample Proportion:
- Denoted as ext{Ƹ}p when from a sample
- Population Proportion:
- Denoted as p when from a population
Example: Finding a Proportion of Cell Phone Non-ownership
- Calculation:
- Proportion of adults without a cell phone:
ext{Ƹ}p = rac{293}{2253} = 0.13 - Percentage: 13%
Example: Relative Frequency Table
- Description:
- Displays proportion of cases in each category
- Sum of relative frequencies equals 1
- Relative Frequency Data:
- Android: 0.203
- iPhone: 0.194
- Blackberry: 0.063
- Non-Smartphone: 0.410
- No Cell Phone: 0.130
- Total: 1
Graphing One Categorical Variable
- Bar Chart:
- Categories represented by bars, height corresponds to frequency, relative frequency, or percentage
- X-axis: Categories; Y-axis: Frequency
Bar Chart Titling
- Generic title:
- "Distribution of 'Name of the Variable' for 'Number and describe the cases'"
- X-axis Label:
- Name of the Variable (e.g., "Cell phone Type")
- Y-axis Label:
- Number of 'Describe the cases' or Proportion
Good Practices when Constructing Bar Charts
- Bar Chart Structure:
- Equally spaced categories on the x-axis
- Uniformly-width bars
- Meaningful labeling
- Y-axis Presentation:
- Begin at zero, scale incrementally, create space above the tallest bar, and clarity in frequency or relative frequency
Example Bar Charts in Statistical Software
- StatKey:
- Titles and labels need to be manually added
- Rguroo:
- Titles and editable labels, bars ordered alphabetically or by frequency (Pareto Chart)
Pie Chart
- Definition:
- Circle divided into sections; area proportional to relative frequency of each category
- Usefulness:
- Depicts relative proportions of categories against the whole
Good Practices when Constructing Pie Charts
- Generic title:
- "Distribution of 'Name of the Variable' for 'Number and describe the cases'"
- Slicing Colors:
- Distinctive colors for differentiation
- Labeling:
- Each slice labeled with category name and display of a legend
What Do We See? Interpretation
- Data Interpretation:
- Analyze graph or chart and numerical summary for distribution insights
- Emphasis on typical outcomes and variability for categorical variables
Typical Outcome
- **Mode: **
- The most frequently occurring category
- In Bar Chart: Tallest bar
- In Pie Chart: Largest slice
- Types of Modes:
- Unimodal: One mode
- Bimodal: Two distinct modes
- Multimodal: More than two modes
Variability
- Definition:
- Diversity of categories within distributions
- Questions to consider: Why many or few categories? Are they evenly distributed? Is there one significantly larger than the rest?
Interpret the Distribution: Cell Phone Types
- Observation Findings:
- Most prevalent category: Non-Smartphone (~41%)
- Distribution shows balanced representation across categories
Two Categorical Variable Relationships
- Two-Way Table:
- Shows relationship between two categorical variables
- Rows represent one variable's categories, columns represent another's
- Each cell denotes count of cases for those categories
- Include totals for both variables in table margins
Example: Two Categorical Variables
- Research Context:
- Relationship between political philosophy and preferred news network
- Sample: 893 regular TV news watchers
Example: Two-Way Table of Political Leaning vs. News Networks
| Conservative | Moderate | Liberal | Total |
|
|---|
| Fox News | 181 | 50 | 47 | 278 |
|
| MSNBC | 10 | 34 | 88 | 132 |
|
| CNN | 11 | 125 | 54 | 190 |
|
| Network News | 54 | 87 | 67 | 208 |
|
| PBS | 23 | 20 | 42 | 85 |
|
| Total | 279 | 316 | 298 | 893 | |
| | | | | |
Example: Proportions Calculation | | | | | |
- Total watching CNN:
- Proportion: rac{190}{893} = 0.2128
- Percentage: 21.28%
Example: Moderates Watching CNN
- Proportion Calculation:
- rac{125}{316} = 0.3956
- Percentage: 39.56%
Example: Moderates Among CNN Watchers
- Proportion Calculation:
- rac{125}{190} = 0.6579
- Percentage: 65.79%
Caution with Interpretation
- Proportion distinctions matter:
- Proportion of CNN watchers who are moderates vs. Proportion of moderates who watch CNN are not equivalent
Example: Compare Proportions of Fox News Watchers
- Proportions:
- Conservative: 181/278 = 0.6511
- Liberal: 47/278 = 0.1691
- Conclusion: Proportion of conservatives is greater
Two-Way Table Measurement: Difference
- Definition:
- Difference in proportions between categories of one variable against levels of another
- Example Calculation:
- ext{Ƹ}pc - ext{Ƹ}pl = 0.6511 - 0.1691 = 0.4820
Two Categorical Variable Graphs
- Visualization Options:
- Segmented Bar Chart:
- Bar heights indicate frequency, segmented by categorical variable
- Side-by-Side Bar Chart:
- Separate charts for each category of one variable representing frequency for corresponding rows
Summary of Topics
- One Categorical Variable:
- Summary statistics: Frequency table, Proportion
- Visualization: Bar chart, Pie chart
- Two Categorical Variables:
- Summary statistics: Two-way table, Difference in proportions
- Visualization: Segmented or Side-by-Side bar chart