Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.
Two basic branches:
Descriptive statistics: collection, organization, summarization, and presentation of data.
Inferential statistics: generalizing from samples to populations; estimating parameters, testing hypotheses, determining relationships among variables, and making predictions.
Descriptive and Inferential Statistics
Descriptive statistics describe data from a sample or population.
Inferential statistics use sample data to make inferences about a population.
Basic Vocabulary
Population: all subjects being studied; depends on the research question; can be people, objects, or numerical values.
Sample: group of subjects selected from a population; important attribute is representativeness of the population.
Parameter: numerical measure that describes an aspect of the population.
Statistic: numerical measure that describes an aspect of a sample.
Notes: a parameter is fixed for a given population, but sample statistics vary from sample to sample. A key goal of Inferential Statistics is to use sample statistics to reliably estimate a parameter.
Parameter vs. Statistic (Example)
Example: measure the approval rating for the President of the U.S.
Population parameter: the percentage of voters in the entire U.S. population who approve of the president’s performance.
If 1000 voters are surveyed and 450 approve, the sample proportion is \,\hat{p} = \frac{450}{1000} = 0.45 which is a statistic.
Population, Sample, Parameter, and Statistic (Insurance Example)
Example: insurer wants the proportion of all medical doctors with at least one malpractice lawsuit.
Population: all medical doctors listed in the professional directory.
Parameter of interest: proportion of medical doctors in the population with at least one malpractice suit.
Sample: 500 doctors selected from the directory.
Statistic: proportion of the 500 doctors in the sample with at least one malpractice suit.
Continued breakdown:
The population is the set of all medical doctors in the directory.
The parameter is the population proportion with a malpractice suit.
The sample is the 500 doctors chosen.
The statistic is the sample proportion with a malpractice suit.
Variables and Data
A variable is a characteristic or attribute that can assume different values.
Data consist of observed values of variables.
A data set is a collection of data values.
Data values are often called observations.
Random variables are variables whose values are determined by chance.
Population vs. Sample Data
Population data come from EVERY individual of interest.
Sample data come from only some individuals from the population.
Types of Variables and Data
Categorical (qualitative) variables: place individuals into distinct categories based on a characteristic.
Numerical (quantitative) variables: numerical in nature and can be ordered or ranked.
Quantitative variables can be further classified:
Discrete: assume values that can be counted.
Continuous: can assume all values between any two specific values.
Examples: Classify Data Type
The number of pairs of shoes owned: quantitative, discrete.
Political party affiliation: categorical.
Distance from home to the nearest grocery store: quantitative, continuous.
Eye color: categorical.
Data Collection Methods
Four common ways to collect data:
From a published source (e.g., government agency, academic journal)
Surveys
Observational studies
Experiments
All involve sampling.
The key to a reliable data collection is avoiding bias.
Sampling and Data Collection Bias
A representative sample is crucial; non-representative samples introduce bias.
A sampling bias occurs when some population members are significantly less likely to be chosen than others.
Randomness in selection helps ensure representativeness.
Simple Random Sampling and Other Random Samples
Simple random sample: every set of n individuals in the population has the same probability of being chosen.
Other random sampling methods exist:
Systematic sampling: select every kth subject after numbering.
Stratified sampling: divide the population into groups (strata) and sample from each group.
Cluster sampling: divide into clusters, randomly select clusters, and include all members of selected clusters.
Convenience sampling: choose the most readily available members; typically not representative and prone to bias.
Sampling Methods (Examples and Identification)
Examples to identify sampling types:
Systematic: picking every 1000th income tax return.
Simple Random: using random numbers to select voters to interview.
Convenience: inspecting the first 100 items produced in a day.
Cluster: randomly selecting whole schools and interviewing all teachers in those schools.
Stratified: sampling proportions of different demographic groups (e.g., race) to match population.
Summary of types: Systematic, Simple Random, Convenience, Cluster, Stratified.
Surveys and Bias
Surveys collect data for business and social sciences; data can be qualitative or quantitative.
Any sampling technique can be used in a survey.
Core concern: bias—systematic tendency toward a particular type of response.
Potential Sources of Bias in Surveys
Non-response bias: some cannot be contacted or refuse; often respondents have strong opinions.
Sensitive subject matter: can cause non-response or dishonesty.
Faulty recall: respondents may not accurately remember events.
Loaded/misleading questions: wording seeks a specific response.
Vague wording: terms like “often”, “seldom”, “usually” interpreted differently.
Interviewer influence: tone/behavior or interview setting can affect answers.
Self-selected (voluntary response): respondents opt in via ads or TV.
Example: Capital Punishment Poll Bias
Local TV show asks viewers to call a toll-free number to express opinions on capital punishment.
The show’s format can bias results and lead-in stories can influence responses.
Observational vs Experimental Studies
Observational study: researcher observes what happened without manipulating variables.
Experimental study: researcher manipulates one variable to see its effect on others.
Case-Control Studies (Observational)
Case-controlled study: observational study with two or more groups.
Cases: participants who engage in the behavior under study.
Controls: participants who do not engage in the behavior.
Allows comparison between cases and controls, but other factors (lurking variables) may affect results.
Lurking variable: a variable not measured that still influences other variables.
Experimental Design and Confounding
Two variables are confounded if their effects cannot be distinguished.
Advantage of experiments: better control of variables, reducing confounding.
Randomization is a key component of experimental design.
Treatment vs Control Groups and Placebo
Treatment group: receives the treatment being tested.
Control group: does not receive the treatment.
Random and alike in all respects except for treatment.
Placebo effect: participants improve due to belief in treatment, even if active ingredients are absent.
Placebo is indistinguishable from real treatment; participants cannot tell difference.
Blinding in Experiments
Single-blind: participants do not know if they are in treatment or control; researchers may know.
Double-blind: neither participants nor researchers know who is in treatment or control.
Example: Aspirin and Heart Attacks (Single Blind, Randomized Controlled Experiment)
Population: all men, age 50-84.
Sample: 400 men in the study.
Treatments: aspirin vs placebo.
Explanatory variable: oral medication.
Response variable: whether the individual had a heart attack.