STA 2023 Intro Statistics I — Comprehensive Study Notes (Overview, Definitions, Data Types, and Statistical Thinking)
Overview of Statistics and Types of Data
- STA 2023: Introduction to Statistics I, covering 1.1/1.2 – Overview of Statistics and Types of Data.
- Statistics involves practices like surveys, data collection, and describing populations; collecting data can be difficult and costly (example: census).
- The United States Census Bureau conducts the census every decade to address these challenges.
What is Statistics
- There is no single universal definition; accepted variants include:
- Statistics is the study of how to collect, organize, analyze, and interpret numerical and/or categorical information.
- A science of collecting, interpreting, and representing data.
- A language, a mechanism for creating and communicating quantitative concepts and ideas.
- A science of decision making in the face of uncertainty.
- Statistics is a way to collect information out of data.
- A mathematical science pertaining to the collection, analysis, interpretation, and presentation of data.
Definition of Statistics
- A common book definition: Statistics is the science of collecting, organizing, analyzing, and interpreting data to make decisions.
Parts of Statistics (Overview from Pages 5–14)
- Core sequence across stages:
- Collection of data
- Planning the collection of data
- Define the goal and collection method
- Collecting the data
- Organizing the data
- Cleaning and transforming the data
- Analysis of data
- Interpretation of the data
- Presentation of the results
- Decision making
- Expanded view includes:
- Analysis of data with descriptive statistics and exploratory data analysis (EDA)
- Inferential statistics
- Interpretation, presentation, and decision making
- Descriptive statistics focus includes:
- Mean, median, mode
- Variability measures
- Skew
- EDA emphasizes:
- Graphs, patterns and trends, outliers, relationships between variables
- Inferential statistics emphasizes:
- Estimation (point estimates, confidence intervals, etc.)
- Hypothesis testing
- Generalizing from sample to population
- Presentation considerations:
- Choosing appropriate format (tables, graphs, text summaries), data visualization, highlighting key results, tailoring presentation to audience, decision making
- Further nuances:
- Emphasizing important trends/patterns, annotations/callouts in graphs, summarizing key statistics
- Simplifying results for non-technical audiences vs. including technical details for experts, and adapting level of detail based on stakeholder needs
- All of the above inform decision making
Definitions and Terminology
- Data: information coming from observations, counts, measurements, or responses. Note: Data is plural. Singular form is datum or data point.
- Population: the collection of all outcomes, responses, measurements, or counts of interest.
- Two kinds: Finite and Infinite
- Sample: a subset or part of a population; usually intended to represent the population in a study.
- Parameter: a numerical description of a population characteristic.
- Statistic: a numerical description of a sample characteristic.
- Parameter is to Population as Statistic is to Sample.
- A population parameter or sample statistic can be a mean, proportion, etc.
- An experiment is a planned activity whose results yield a set of data.
- Sampling error: the naturally occurring discrepancy between a sample statistic and the corresponding population parameter.
- Individuals: the objects described by data (people, animals, places, things). In clinical trials they’re called subjects.
- In set theory, individuals are called elements.
Parameter vs Statistic mappings (Population vs Sample)
- Population: size $N$; Parameter describes population characteristics (e.g., $ ext{mean}=
[0m\mu$, $ ext{sd}=
\sigma$, $ ext{variance}=
\sigma^2$, $ ext{proportion}=
p$). - Sample: size $n$; Statistic describes sample characteristics (e.g., $ar{x}$, $s$, $s^2$, $ ilde{p}$).
- Common relationships:
- Population mean: $ ext{mean} =
\mu$ - Population standard deviation: $ ext{sd} =
\sigma$ - Population variance: $ ext{var} =
\sigma^2$ - Population proportion: $ ext{prop} = p$
- Sample mean: $ar{x}$
- Sample standard deviation: $s$
- Sample variance: $s^2$
- Sample proportion: $\,\, \hat{p}$
- From pages 19–21: Notation includes:
- Population $N$; Sample $n$; Mean, Standard Deviation, Variance; Proportion
- The table of Population vs Sample vs Parameter vs Statistic shows typical roles and sizes
Population vs Sample: Two Types of Data Sets
- Population: The collection of all individuals, outcomes, responses, measurements, or counts of interest.
- Sample: A subset, or part, of a population; representativeness is essential; sampling method must be appropriate to ensure representativeness.
- Example context: In a study, sampling aims to reflect the population accurately;
- Population size is $N$; sample size is $n$.
Examples: Identifying Population, Sample, Data Set
- Example 1 (Page 22):
- Scenario: 751 employees in the United States were asked how stressed they feel at work.
- Population: all employees in the United States.
- Sample: the 751 respondents.
- Data set: the collection of responses from those 751 employees.
- Example 2 (Page 23): Determine whether each data set is a population or a sample:
- The salary of each employee in the mathematics department.
- The amount of energy collected from every solar panel in a photovoltaic power plant.
- A survey of 250 members from an organization (union) of over 2,000 members.
- The carbon monoxide levels of 12 of 49 people who escaped a burning building.
- Example 3 (Page 24): Determine whether each number describes a population parameter or a sample statistic:
a) US survey of about 9,400 individuals aged 15+ found average leisure hours per day = $5.19$.
- Likely a sample statistic (sample mean).
b) Freshman class average SAT Math score = $514$. - Likely a population parameter (if referring to all freshmen at that university); could be a statistic if referring to a sample of freshmen.
c) FDA found 34% of stores not storing fish properly in several hundred retail stores. - A sample proportion (statistic).
d) January 2021, 54% of the governors of the 50 states were Republicans. - Population proportion (parameter) since it covers all 50 governors.
- Page 25: The Census and American Community Survey (ACS):
- For 2010 Census, short forms collected basic demographics from all households.
- Long form previously went to ~17% of the population; replaced by ACS, surveying more than 3.5 million households per year.
- Data from a sample (ACS) are used to infer characteristics about the entire population.
- Page 26: Fast-growing states data visualization example (regions and percentages):
- Illustrates how data are summarized and presented (e.g., regional growth, numerical increases).
Descriptive vs Inferential Statistics
- Descriptive statistics: organization, summarization, and display of data; collection, presentation, and description of sample data; numeric values or graphs characterizing variable behavior.
- Inferential statistics: using a sample to draw conclusions about a population; probability is a basic tool; involves making decisions and drawing conclusions about a population.
- Practical use: descriptive summarizes data; inferential generalizes to population from sample.
Examples: Descriptive vs Inferential (Pages 28–30)
- Example 1: A study reports that 58% chose natural treatments; 71% of those are under 35.
- Descriptive: statements about the sample.
- Inferential: possible inference that younger individuals are more likely to choose natural treatments.
- Example 2: 75 randomly selected students; 35 study medicine.
- Descriptive: 50% of the 75 students study medicine.
- Inferential: 50% of all university students study medicine.
- Example 3: 212 randomly selected students; 163 use college shuttles.
- Descriptive: Approximately 77% of these 212 use shuttles.
- Inferential: 77% of all college students use the shuttle.
- Example 4–5: Practice with identifying population, sample, and whether conclusions are descriptive vs inferential (pages 31–33).
Terminology: Variable, Data, Experiment, and Related Concepts (Pages 34–36)
- Variable: A characteristic about each individual element of a population or sample.
- Data (singular): The value of the variable for one element; may be a number, word, or symbol; can be a single datum.
- Data (plural): The set of values collected for the variable from each element in the sample.
- Experiment: A planned activity whose results yield data.
- Sampling error: The natural discrepancy between a sample statistic and the corresponding population parameter.
- Individuals: The objects described by data (person, animal, place, thing); in medical trials, individuals are called subjects.
- In set theory, individuals are called elements.
Terminology: Descriptive and Inference (Glossary Coherence)
- A Valid measure is relevant/appropriate to represent the property being studied.
- A Reliable measure has small random error.
- Census: full data collection on every member of the population.
- Sample survey: data collected from only some individuals.
Variable Types
- Qualitative (Categorical) Variable:
- Describes an element with nonnumeric attributes or labels; nominal measurement describes categorization (no natural order).
- Examples: phone numbers, college year (freshman), addresses, colors.
- Quantitative (Numerical) Variable:
- Quantifies an element with numerical measurements or counts; meaningful arithmetic operations.
- Examples: GPA, age, tuition, etc.
Discrete vs Continuous (Quantitative Subtypes)
- Discrete: gaps between values; usually whole numbers; finite or countably infinite.
- Examples: number of daughters, number of cars per hour.
- Continuous: all values in an interval; no natural gaps; can be fractional.
- Examples: height, time, rainfall amount.
Variable Type Scenarios (Examples)
- Example 1 (Qualitative vs Quantitative):
- Number of A's earned by a class of 238 students: Discrete, Quantitative.
- Amount of rainfall for a city over a week: Continuous, Quantitative.
- Number of cars visiting a drive-through in an hour: Discrete, Quantitative.
- Amount of gasoline pumped by the next 15 customers: Continuous, Quantitative.
- Length of time to complete an exam: Continuous, Quantitative.
- Height of a player: Continuous, Quantitative.
- Count of blue baseball caps: Discrete, Quantitative.
Check Your Understanding (Practice Questions with Answers)
- Question: The average score of 2000 statistics students at USF was computed from a sample of 70 students with an average of 76.32. The 76.32 is:
- Answer: Statistic (computed from a sample).
- Question: The 70 students are:
- Answer: Sample (subset of the population).
- Question: The 2000 students mentioned are:
- Answer: Population (the entire group of interest).
- Population size: $N$; Sample size: $n$.
- Population mean: ar{?} [Note: See below for standard notation]
- Population mean: μ; Population standard deviation: σ; Population variance: σ2; Population proportion: p.
- Sample mean: xˉ; Sample standard deviation: s; Sample variance: s2; Sample proportion: p^.
- Basic point estimates:
- Point estimate of the population mean: xˉ.
- Point estimate of the population proportion: p^.
- Confidence intervals and hypothesis testing are core components of estimation and inference (conceptual overview as introduced here).
Notes on Data Sources and Inference (Real-World Relevance)
- Census data provide complete population information but are costly; surveys and ACS provide samples to infer population characteristics.
- Understanding when to use descriptive vs inferential statistics is essential for valid conclusions and effective communication.
Quick Reference: Common Concepts and Symbols
- Population: entire group of interest; size $N$; parameterized characteristics ($\mu, \sigma, p$).
- Sample: subset of population; size $n$; statistics ($\bar{x}, s, \hat{p}$).
- Parameter → population characteristic; Statistic → sample characteristic.
- Data types: Qualitative (categorical) vs Quantitative (numerical); Further split: Discrete vs Continuous for quantitative data.
- Descriptive statistics: summarize and describe sample data.
- Inferential statistics: use sample to infer population characteristics; relies on probability theory.