STA 2023 Intro Statistics I — Comprehensive Study Notes (Overview, Definitions, Data Types, and Statistical Thinking)

Overview of Statistics and Types of Data

  • STA 2023: Introduction to Statistics I, covering 1.1/1.2 – Overview of Statistics and Types of Data.
  • Statistics involves practices like surveys, data collection, and describing populations; collecting data can be difficult and costly (example: census).
  • The United States Census Bureau conducts the census every decade to address these challenges.

What is Statistics

  • There is no single universal definition; accepted variants include:
    • Statistics is the study of how to collect, organize, analyze, and interpret numerical and/or categorical information.
    • A science of collecting, interpreting, and representing data.
    • A language, a mechanism for creating and communicating quantitative concepts and ideas.
    • A science of decision making in the face of uncertainty.
    • Statistics is a way to collect information out of data.
    • A mathematical science pertaining to the collection, analysis, interpretation, and presentation of data.

Definition of Statistics

  • A common book definition: Statistics is the science of collecting, organizing, analyzing, and interpreting data to make decisions.

Parts of Statistics (Overview from Pages 5–14)

  • Core sequence across stages:
    • Collection of data
    • Planning the collection of data
    • Define the goal and collection method
    • Collecting the data
    • Organizing the data
    • Cleaning and transforming the data
    • Analysis of data
    • Interpretation of the data
    • Presentation of the results
    • Decision making
  • Expanded view includes:
    • Analysis of data with descriptive statistics and exploratory data analysis (EDA)
    • Inferential statistics
    • Interpretation, presentation, and decision making
  • Descriptive statistics focus includes:
    • Mean, median, mode
    • Variability measures
    • Skew
  • EDA emphasizes:
    • Graphs, patterns and trends, outliers, relationships between variables
  • Inferential statistics emphasizes:
    • Estimation (point estimates, confidence intervals, etc.)
    • Hypothesis testing
    • Generalizing from sample to population
  • Presentation considerations:
    • Choosing appropriate format (tables, graphs, text summaries), data visualization, highlighting key results, tailoring presentation to audience, decision making
  • Further nuances:
    • Emphasizing important trends/patterns, annotations/callouts in graphs, summarizing key statistics
    • Simplifying results for non-technical audiences vs. including technical details for experts, and adapting level of detail based on stakeholder needs
  • All of the above inform decision making

Definitions and Terminology

  • Data: information coming from observations, counts, measurements, or responses. Note: Data is plural. Singular form is datum or data point.
  • Population: the collection of all outcomes, responses, measurements, or counts of interest.
    • Two kinds: Finite and Infinite
  • Sample: a subset or part of a population; usually intended to represent the population in a study.
  • Parameter: a numerical description of a population characteristic.
  • Statistic: a numerical description of a sample characteristic.
  • Parameter is to Population as Statistic is to Sample.
  • A population parameter or sample statistic can be a mean, proportion, etc.
  • An experiment is a planned activity whose results yield a set of data.
  • Sampling error: the naturally occurring discrepancy between a sample statistic and the corresponding population parameter.
  • Individuals: the objects described by data (people, animals, places, things). In clinical trials they’re called subjects.
  • In set theory, individuals are called elements.

Parameter vs Statistic mappings (Population vs Sample)

  • Population: size $N$; Parameter describes population characteristics (e.g., $ ext{mean}=
    \mu$, $ ext{sd}=
    \sigma$, $ ext{variance}=
    \sigma^2$, $ ext{proportion}=
    p$).
  • Sample: size $n$; Statistic describes sample characteristics (e.g., $ar{x}$, $s$, $s^2$, $ ilde{p}$).
  • Common relationships:
    • Population mean: $ ext{mean} =
      \mu$
    • Population standard deviation: $ ext{sd} =
      \sigma$
    • Population variance: $ ext{var} =
      \sigma^2$
    • Population proportion: $ ext{prop} = p$
    • Sample mean: $ar{x}$
    • Sample standard deviation: $s$
    • Sample variance: $s^2$
    • Sample proportion: $\,\, \hat{p}$
  • From pages 19–21: Notation includes:
    • Population $N$; Sample $n$; Mean, Standard Deviation, Variance; Proportion
    • The table of Population vs Sample vs Parameter vs Statistic shows typical roles and sizes

Population vs Sample: Two Types of Data Sets

  • Population: The collection of all individuals, outcomes, responses, measurements, or counts of interest.
  • Sample: A subset, or part, of a population; representativeness is essential; sampling method must be appropriate to ensure representativeness.
  • Example context: In a study, sampling aims to reflect the population accurately;
    • Population size is $N$; sample size is $n$.

Examples: Identifying Population, Sample, Data Set

  • Example 1 (Page 22):
    • Scenario: 751 employees in the United States were asked how stressed they feel at work.
    • Population: all employees in the United States.
    • Sample: the 751 respondents.
    • Data set: the collection of responses from those 751 employees.
  • Example 2 (Page 23): Determine whether each data set is a population or a sample:
    • The salary of each employee in the mathematics department.
    • The amount of energy collected from every solar panel in a photovoltaic power plant.
    • A survey of 250 members from an organization (union) of over 2,000 members.
    • The carbon monoxide levels of 12 of 49 people who escaped a burning building.
  • Example 3 (Page 24): Determine whether each number describes a population parameter or a sample statistic: a) US survey of about 9,400 individuals aged 15+ found average leisure hours per day = $5.19$.
    • Likely a sample statistic (sample mean).
      b) Freshman class average SAT Math score = $514$.
    • Likely a population parameter (if referring to all freshmen at that university); could be a statistic if referring to a sample of freshmen.
      c) FDA found 34% of stores not storing fish properly in several hundred retail stores.
    • A sample proportion (statistic).
      d) January 2021, 54% of the governors of the 50 states were Republicans.
    • Population proportion (parameter) since it covers all 50 governors.
  • Page 25: The Census and American Community Survey (ACS):
    • For 2010 Census, short forms collected basic demographics from all households.
    • Long form previously went to ~17% of the population; replaced by ACS, surveying more than 3.5 million households per year.
    • Data from a sample (ACS) are used to infer characteristics about the entire population.
  • Page 26: Fast-growing states data visualization example (regions and percentages):
    • Illustrates how data are summarized and presented (e.g., regional growth, numerical increases).

Descriptive vs Inferential Statistics

  • Descriptive statistics: organization, summarization, and display of data; collection, presentation, and description of sample data; numeric values or graphs characterizing variable behavior.
  • Inferential statistics: using a sample to draw conclusions about a population; probability is a basic tool; involves making decisions and drawing conclusions about a population.
  • Practical use: descriptive summarizes data; inferential generalizes to population from sample.

Examples: Descriptive vs Inferential (Pages 28–30)

  • Example 1: A study reports that 58% chose natural treatments; 71% of those are under 35.
    • Descriptive: statements about the sample.
    • Inferential: possible inference that younger individuals are more likely to choose natural treatments.
  • Example 2: 75 randomly selected students; 35 study medicine.
    • Descriptive: 50% of the 75 students study medicine.
    • Inferential: 50% of all university students study medicine.
  • Example 3: 212 randomly selected students; 163 use college shuttles.
    • Descriptive: Approximately 77% of these 212 use shuttles.
    • Inferential: 77% of all college students use the shuttle.
  • Example 4–5: Practice with identifying population, sample, and whether conclusions are descriptive vs inferential (pages 31–33).

Terminology: Variable, Data, Experiment, and Related Concepts (Pages 34–36)

  • Variable: A characteristic about each individual element of a population or sample.
  • Data (singular): The value of the variable for one element; may be a number, word, or symbol; can be a single datum.
  • Data (plural): The set of values collected for the variable from each element in the sample.
  • Experiment: A planned activity whose results yield data.
  • Sampling error: The natural discrepancy between a sample statistic and the corresponding population parameter.
  • Individuals: The objects described by data (person, animal, place, thing); in medical trials, individuals are called subjects.
  • In set theory, individuals are called elements.

Terminology: Descriptive and Inference (Glossary Coherence)

  • A Valid measure is relevant/appropriate to represent the property being studied.
  • A Reliable measure has small random error.
  • Census: full data collection on every member of the population.
  • Sample survey: data collected from only some individuals.

Variable Types

  • Qualitative (Categorical) Variable:
    • Describes an element with nonnumeric attributes or labels; nominal measurement describes categorization (no natural order).
    • Examples: phone numbers, college year (freshman), addresses, colors.
  • Quantitative (Numerical) Variable:
    • Quantifies an element with numerical measurements or counts; meaningful arithmetic operations.
    • Examples: GPA, age, tuition, etc.

Discrete vs Continuous (Quantitative Subtypes)

  • Discrete: gaps between values; usually whole numbers; finite or countably infinite.
    • Examples: number of daughters, number of cars per hour.
  • Continuous: all values in an interval; no natural gaps; can be fractional.
    • Examples: height, time, rainfall amount.

Variable Type Scenarios (Examples)

  • Example 1 (Qualitative vs Quantitative):
    • Number of A's earned by a class of 238 students: Discrete, Quantitative.
    • Amount of rainfall for a city over a week: Continuous, Quantitative.
    • Number of cars visiting a drive-through in an hour: Discrete, Quantitative.
    • Amount of gasoline pumped by the next 15 customers: Continuous, Quantitative.
    • Length of time to complete an exam: Continuous, Quantitative.
    • Height of a player: Continuous, Quantitative.
    • Count of blue baseball caps: Discrete, Quantitative.

Check Your Understanding (Practice Questions with Answers)

  • Question: The average score of 2000 statistics students at USF was computed from a sample of 70 students with an average of 76.32. The 76.32 is:
    • Answer: Statistic (computed from a sample).
  • Question: The 70 students are:
    • Answer: Sample (subset of the population).
  • Question: The 2000 students mentioned are:
    • Answer: Population (the entire group of interest).

Notation and Summary Formulas (Key Concepts)

  • Population size: $N$; Sample size: $n$.
  • Population mean: ar{?} [Note: See below for standard notation]
  • Population mean: μ\mu; Population standard deviation: σ\sigma; Population variance: σ2\sigma^2; Population proportion: pp.
  • Sample mean: xˉ\bar{x}; Sample standard deviation: ss; Sample variance: s2s^2; Sample proportion: p^\hat{p}.
  • Basic point estimates:
    • Point estimate of the population mean: xˉ\bar{x}.
    • Point estimate of the population proportion: p^\hat{p}.
  • Confidence intervals and hypothesis testing are core components of estimation and inference (conceptual overview as introduced here).

Notes on Data Sources and Inference (Real-World Relevance)

  • Census data provide complete population information but are costly; surveys and ACS provide samples to infer population characteristics.
  • Understanding when to use descriptive vs inferential statistics is essential for valid conclusions and effective communication.

Quick Reference: Common Concepts and Symbols

  • Population: entire group of interest; size $N$; parameterized characteristics ($\mu, \sigma, p$).
  • Sample: subset of population; size $n$; statistics ($\bar{x}, s, \hat{p}$).
  • Parameter → population characteristic; Statistic → sample characteristic.
  • Data types: Qualitative (categorical) vs Quantitative (numerical); Further split: Discrete vs Continuous for quantitative data.
  • Descriptive statistics: summarize and describe sample data.
  • Inferential statistics: use sample to infer population characteristics; relies on probability theory.