Stat 211: Elementary Inferential Statistics - Study Notes
Stat 211: Elementary Inferential Statistics
Unit 1: Data and Collection
Identifying Data
Date: Introduction to Statistics
Definitions: Statistics and Statistic
Statistics (plural): The science of learning from data. It involves collecting, organizing, analyzing, and interpreting information to understand patterns, make decisions, and draw conclusions in the presence of uncertainty.
Statistic (singular): A numerical summary that describes a characteristic of a SAMPLE.
Population vs Sample
Population: The entire group of individuals or cases that we are interested in studying or drawing conclusions about.
Sample: A subset of the population that is actually observed or collected for analysis. It is used to estimate information about the population.
Parameters & Statistics:
Populations have parameters: numerical measurements describing characteristics of a population (usually unknown, denoted by Greek letters).
Samples have statistics: numerical measurements describing characteristics of a sample (calculated to estimate parameters).
Types of Statistics
A statistic is a number that summarizes data from a sample and serves various roles:
Descriptive: Describes what was observed.
Summary: Condenses raw data into a meaningful value.
Explanatory: Helps explore relationships between variables.
Predictive: Used to forecast outcomes.
Statistical Analysis
Descriptive Statistics: Summarizing and organizing data.
Inferential Statistics: Making predictions or decisions about a population based on a sample.
The Process of a Statistical Study
Identify Goals
Draw from Population
Sample Population
Draw Conclusions
Summarize Sample Statistics
Applications of Statistics
Data Collection Examples
Application: Facebook
Data Collection: Gathers user behavior (likes, shares, clicks, time on posts).
Descriptive Statistics: Summarizes activity (average time spent, most liked posts, user demographics).
Inferential Statistics: Makes predictions (What posts will you like next? Who might you know?).
Application: Texting while Driving
Study Details: 40 drivers divided into three groups: sober, drunk, and texting.
Measured: Reaction times in simulated emergencies.
Findings: Reaction times of texting drivers were significantly slower than both sober and drunk drivers.
Application: Driving While Black
Study Details: 2,533 traffic stops in Cincinnati, OH.
Focus: Investigated disproportionate stops and searches by race.
Discussion Point: Raises the question: Can these outcomes be explained by random variation alone?
Data Structures
Organization: Tables, matrices, and data frames keep data tidy and accessible.
Efficiency: Enables faster searching, sorting, grouping, and computation.
Interpretation: Makes it easier to summarize, graph, and derive insights.
Analysis-Ready: Most tools require data in specific formats (e.g., rows = observations, columns = variables).
Datasets, Variables, and Cases
Cases: The units or subjects being studied.
Variables: The characteristics or attributes measured on each case.
Variable Types
Categorical (Qualitative): Places values into groups or categories.
Quantitative: Numerical values where magnitude matters.
Identifier: Unique labels for each case (not used in analysis).
Categorical Variables (Qualitative)
Nominal: No natural order (e.g., types of fruit).
Ordinal: Categories with a meaningful order (e.g., education level).
Quantitative Variables
Discrete: Countable values (e.g., number of students).
Continuous: Any value in an interval (e.g., weight).
Quantitative Measurement Types
Interval: Equal spacing, but no true zero (e.g., temperature in Celsius).
Ratio: Equal spacing with a true zero (e.g., height).
Practice: Classify These Variables
Age: Quantitative (Continuous)
Musical Genres: Categorical (Nominal)
Price of Computer: Quantitative (Continuous)
Marital Status: Categorical (Nominal)
Number of Pixels Displayed: Quantitative (Discrete)
Time Needed to Complete Exam: Quantitative (Continuous)
Sampling Techniques
Sampling Overview
Goal: Learn about the entire group of individuals called the population.
Problem: It is usually impossible to collect data on the entire population (called a census).
Compromise: Collect data on a smaller group of individuals (called a sample) selected from the population.
Challenge: Obtain a sample that is perfectly representative of the population while avoiding Bias.
Bias: The over- or under-emphasizing of some characteristic that is pertinent to the study.
Census
A census is when data is collected for the entire population.
Reasons for Not Conducting a Census:
Difficult to complete.
Hard/expensive to locate everyone.
Impractical in manufacturing.
Populations change over time.
Opinions may change.
The U.S. Census: Conducted every 10 years as required by the U.S. Constitution. Used for congressional representation, federal funding, research, and planning.
Randomization
Definition: Randomizing protects us from the influences we know and don't know are in the data.
Key Idea: On average, a randomized sample will look like the population.
Sample Size
Sample Size Definition: The number of individuals in the sample.
Guideline: A few hundred may be enough for a proportion; in general, the bigger the better.
Incomplete Sampling Frame: Not all individuals in the population are included in the list from which the sample is taken.
Sampling Methods
Simple Random Sampling (SRS): - Choosing a subset of a population where each member has an equal chance of being selected.
Example: Dining Hall - Use Excel and the function = rand() to assign random values, then sort.
Stratified Random Sampling: - Sampling by dividing the population into homogenous groups called strata and selecting proportionate amounts from each group.
Benefits: More precise estimates, less variability, detect differences among groups.
Simpson's Paradox: A trend appears in separate groups but disappears or reverses when groups are combined.
Cluster Sampling: - Divide the population into clusters that are mutually homogenous yet internally heterogeneous, then sample whole clusters at random.
Benefits: Less time and cost, natural groupings (e.g., dorms, classes, majors), easier administration (e.g., professors can mandate responses).
Analogy: Stratified Sampling = takes a bite from each layer of a pie; Cluster Sampling = takes a vertical slice through the whole cake.
Systematic Sampling: - Choose every nth person after selecting a random starting point (the order must not be related to the outcome).
Example: Select students in a dorm systematically.
Multistage Sampling: - Sampling schemes that combine several methods are called multistage samples.
Visualization of Sampling Methods
SRS: Randomly selected sample.
Stratified: Grouped by strata attributes.
Systematic: Selected at intervals.
Cluster: Whole groups sampled.
Multistage: Combination of methods.
Practice Quiz: Name that Sampling Method
Pick every 10th passenger on a flight → Systematic
Randomly choose 5 from first class, 25 from coach → Stratified
Randomly generate 30 seat numbers → Simple Random Sampling
Survey everyone sitting in window seats → Cluster Sampling
Bad Sampling Techniques
Voluntary Response Sampling: Individuals choose to participate, prone to bias.
Examples: "Tell Us What You Think" website.
Convenience Sampling: Drawn from those easiest to reach, prone to bias.
Example: Stopping people outside a dining hall.
Bias in Statistical Studies
What is Bias?
Definition: Bias is the degree to which a procedure systematically over- or under-estimates a population value.
A procedure is unbiased if it produces the true population parameter on average.
Major Categories of Bias
Selection Bias: Affects the sample selection process.
Response Bias: Affects the responses provided by respondents.
Types of Selection Bias
Voluntary Response Bias:
Occurs when people choose to participate in a survey, often reflecting only the opinions of those with strong opinions.
Example: A survey asking people to call in their support for a new issue.
Non-Response Bias:
Occurs when people are unwilling to participate and those who do not respond may differ in opinions from those who do.
Sampling Bias:
Some individuals are more likely to be selected than others.
Example: Favoring a specific group unintentionally.
Types of Response Bias
Social Acceptability Bias:
People may give answers they think are more socially acceptable.
Example: Over-reporting favorable behaviors like recycling.
Leading Question Bias:
The wording of the question suggests a preferred response.
Example: "Don’t you agree that our policy is beneficial?"
Acquiescence Bias:
The tendency to agree with statements, regardless of true beliefs.
Example: Likely in surveys that use scales (e.g., Strongly Agree to Strongly Disagree).
Self-Interest Bias:
Arises when individuals or organizations have a self-interest in the outcome, which can influence both the study and how the results are analyzed.
Key Takeaway for Types of Bias
Type of Bias | Description |
|---|---|
Voluntary Response | Only people with strong opinions tend to participate |
Non-Response | Those who don't respond may differ meaningfully from responders |
Selection Bias | Certain groups are underrepresented in the way the sample is drawn |
Social Acceptability | People give socially acceptable answers |
Leading Question | Wording nudges toward a preferred answer |
Acquiescence | Tendency to agree regardless of true belief |
Self-Interest | Result influenced by parties with something to gain |
Practice: Identify the Type of Bias
Local business owner asks residents to call a hotline to show support for a new stadium: Type of Bias: Voluntary Response Bias
Police chief sends uniformed officers to ask if residents think the police are doing a bad job: Type of Bias: Response Bias (Social Acceptability Bias)
Candidate's campaign website claims only 11% support a rival's policy: Type of Bias: Self-Interest Bias
Bank mails 8,000 surveys; only 500 are returned: Type of Bias: Non-Response Bias
Fitness center asks, “Why do YOU love our new 24-hour access policy?”: Type of Bias: Leading Question Bias