Chapter 1 Notes

Introduction to Statistics

  • Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.
  • Two basic branches:
    • Descriptive statistics: collection, organization, summarization, and presentation of data.
    • Inferential statistics: generalizing from samples to populations; estimating parameters, testing hypotheses, determining relationships among variables, and making predictions.

Descriptive and Inferential Statistics

  • Descriptive statistics describe data from a sample or population.
  • Inferential statistics use sample data to make inferences about a population.

Basic Vocabulary

  • Population: all subjects being studied; depends on the research question; can be people, objects, or numerical values.
  • Sample: group of subjects selected from a population; important attribute is representativeness of the population.
  • Parameter: numerical measure that describes an aspect of the population.
  • Statistic: numerical measure that describes an aspect of a sample.
  • Notes: a parameter is fixed for a given population, but sample statistics vary from sample to sample. A key goal of Inferential Statistics is to use sample statistics to reliably estimate a parameter.

Parameter vs. Statistic (Example)

  • Example: measure the approval rating for the President of the U.S.
    • Population parameter: the percentage of voters in the entire U.S. population who approve of the president’s performance.
    • If 1000 voters are surveyed and 450 approve, the sample proportion is \,\hat{p} = \frac{450}{1000} = 0.45 which is a statistic.

Population, Sample, Parameter, and Statistic (Insurance Example)

  • Example: insurer wants the proportion of all medical doctors with at least one malpractice lawsuit.
    • Population: all medical doctors listed in the professional directory.
    • Parameter of interest: proportion of medical doctors in the population with at least one malpractice suit.
    • Sample: 500 doctors selected from the directory.
    • Statistic: proportion of the 500 doctors in the sample with at least one malpractice suit.
  • Continued breakdown:
    • The population is the set of all medical doctors in the directory.
    • The parameter is the population proportion with a malpractice suit.
    • The sample is the 500 doctors chosen.
    • The statistic is the sample proportion with a malpractice suit.

Variables and Data

  • A variable is a characteristic or attribute that can assume different values.
  • Data consist of observed values of variables.
  • A data set is a collection of data values.
  • Data values are often called observations.
  • Random variables are variables whose values are determined by chance.

Population vs. Sample Data

  • Population data come from EVERY individual of interest.
  • Sample data come from only some individuals from the population.

Types of Variables and Data

  • Categorical (qualitative) variables: place individuals into distinct categories based on a characteristic.
  • Numerical (quantitative) variables: numerical in nature and can be ordered or ranked.
  • Quantitative variables can be further classified:
    • Discrete: assume values that can be counted.
    • Continuous: can assume all values between any two specific values.

Examples: Classify Data Type

  • The number of pairs of shoes owned: quantitative, discrete.
  • Political party affiliation: categorical.
  • Distance from home to the nearest grocery store: quantitative, continuous.
  • Eye color: categorical.

Data Collection Methods

  • Four common ways to collect data:
    • From a published source (e.g., government agency, academic journal)
    • Surveys
    • Observational studies
    • Experiments
  • All involve sampling.
  • The key to a reliable data collection is avoiding bias.

Sampling and Data Collection Bias

  • A representative sample is crucial; non-representative samples introduce bias.
  • A sampling bias occurs when some population members are significantly less likely to be chosen than others.
  • Randomness in selection helps ensure representativeness.

Simple Random Sampling and Other Random Samples

  • Simple random sample: every set of n individuals in the population has the same probability of being chosen.
  • Other random sampling methods exist:
    • Systematic sampling: select every kth subject after numbering.
    • Stratified sampling: divide the population into groups (strata) and sample from each group.
    • Cluster sampling: divide into clusters, randomly select clusters, and include all members of selected clusters.
  • Convenience sampling: choose the most readily available members; typically not representative and prone to bias.

Sampling Methods (Examples and Identification)

  • Examples to identify sampling types:
    • Systematic: picking every 1000th income tax return.
    • Simple Random: using random numbers to select voters to interview.
    • Convenience: inspecting the first 100 items produced in a day.
    • Cluster: randomly selecting whole schools and interviewing all teachers in those schools.
    • Stratified: sampling proportions of different demographic groups (e.g., race) to match population.
  • Summary of types: Systematic, Simple Random, Convenience, Cluster, Stratified.

Surveys and Bias

  • Surveys collect data for business and social sciences; data can be qualitative or quantitative.
  • Any sampling technique can be used in a survey.
  • Core concern: bias—systematic tendency toward a particular type of response.

Potential Sources of Bias in Surveys

  • Non-response bias: some cannot be contacted or refuse; often respondents have strong opinions.
  • Sensitive subject matter: can cause non-response or dishonesty.
  • Faulty recall: respondents may not accurately remember events.
  • Loaded/misleading questions: wording seeks a specific response.
  • Vague wording: terms like “often”, “seldom”, “usually” interpreted differently.
  • Interviewer influence: tone/behavior or interview setting can affect answers.
  • Self-selected (voluntary response): respondents opt in via ads or TV.

Example: Capital Punishment Poll Bias

  • Local TV show asks viewers to call a toll-free number to express opinions on capital punishment.
  • The show’s format can bias results and lead-in stories can influence responses.

Observational vs Experimental Studies

  • Observational study: researcher observes what happened without manipulating variables.
  • Experimental study: researcher manipulates one variable to see its effect on others.

Case-Control Studies (Observational)

  • Case-controlled study: observational study with two or more groups.
    • Cases: participants who engage in the behavior under study.
    • Controls: participants who do not engage in the behavior.
  • Allows comparison between cases and controls, but other factors (lurking variables) may affect results.
  • Lurking variable: a variable not measured that still influences other variables.

Experimental Design and Confounding

  • Two variables are confounded if their effects cannot be distinguished.
  • Advantage of experiments: better control of variables, reducing confounding.
  • Randomization is a key component of experimental design.

Treatment vs Control Groups and Placebo

  • Treatment group: receives the treatment being tested.
  • Control group: does not receive the treatment.
  • Random and alike in all respects except for treatment.
  • Placebo effect: participants improve due to belief in treatment, even if active ingredients are absent.
  • Placebo is indistinguishable from real treatment; participants cannot tell difference.

Blinding in Experiments

  • Single-blind: participants do not know if they are in treatment or control; researchers may know.
  • Double-blind: neither participants nor researchers know who is in treatment or control.

Example: Aspirin and Heart Attacks (Single Blind, Randomized Controlled Experiment)

  • Population: all men, age 50-84.
  • Sample: 400 men in the study.
  • Treatments: aspirin vs placebo.
  • Explanatory variable: oral medication.
  • Response variable: whether the individual had a heart attack.
  • Design: single-blind, randomized controlled experiment.

Which is Better: Experiment vs Observational?

  • Scenarios:
    • Determine if listening to jazz while studying improves grades.
    • Determine if taking aspirin daily reduces heart attack incidence.
    • Determine if high trans-fat intake contributes to diabetes.
    • Determine which seed corn yields the highest per-acre yield.
  • Generally, experiments offer more control and stronger causal inference; observational studies are more prone to confounding.

Common Problems with Statistical Studies

  • Issues:
    • Studies based on poor samples: very small samples, biased sample selection, volunteer/self-selected samples, faulty survey questions.
    • Poorly defined variables: clear measurement definitions.
    • Poorly defined questions; study objectives unclear.
    • Self-funded or self-interest studies: sponsor has a claim to support.
    • Faulty conclusions or misrepresentation of results:
    • Misleading graphs: selective data presentation.
    • Interpreting correlation as causation.
    • Lack of context.
    • Confounding.

Summary

  • Two major branches: descriptive and inferential.
  • Population data is usually unavailable; use sample data.
  • Four basic sampling methods: ext{Simple random}, ext{Systematic}, ext{Stratified}, ext{Cluster}.
  • Data can be classified as qualitative (categorical) or quantitative.
  • Quantitative data can be discrete or continuous.
  • Basic types of statistical studies: surveys, observational studies, controlled experiments.
  • Key concepts: sampling bias, randomization, case-controlled studies, lurking variables, confounding, blinding, placebo.
  • The goal of statistics is to collect, organize, summarize, analyze, and draw conclusions from data with awareness of biases and study design.