Introduction to Data and Statistics (Video)
Chapter 1: Introduction to Data and Statistics
Section 1: Why Learn Statistics?
Personal anecdote sets the stage: a warm August night in Chicago; grandfather in hospital; dizziness and dehydration. Night nurse administers hypertension medication based on BP reading.
Observation: grandfather’s daily routine included measuring BP every morning; he typically recorded BP and pulse and kept a notebook.
Current BP reading: 110. The nurse explains policy: meds given if BP > 130; not given if BP < 100; when BP is between 100–130, nurse discretion, but meds usually given.
Question raised: Is 110 a typical value for this grandfather? If given medication, BP could skyrocket overnight and then crash; if not given medication when it’s actually needed, BP could plummet and crash. Both outcomes are potentially dangerous.
The learner decides not to give the medicine, guided by statistical thinking, even without medical training. Confidence rooted in a statistical approach.
Core messages:
This class teaches how to think statistically and apply statistical thinking to life decisions.
Statistics is the science of variability and decision making under uncertainty.
By thinking statistically, you can better navigate uncertainty you will face in life.
The goal is to learn how to think calmly and confidently in real-world hospital-like situations.
Big idea: statistics as a decision-support tool under uncertainty, grounded in understanding variability and typical patterns.
Section 2: What is Statistics?
Common misconceptions: statistics is not only math, numbers, graphs, data, equations, or fancy analytics; at heart, statistics is something broader.
Core definition: Statistics is a lens—a perspective to approach the world and solve problems in the face of uncertainty. It is the science of variability.
Concept of variation:
Variation is universal: even in something as simple as hair on a head, no two hairs are identical.
When we zoom out, patterns and similarities emerge despite variability.
Statistics guides us through uncertainty by focusing on these patterns and similarities rather than single observations.
Practical framing: statistics helps you think about how things vary, identify patterns, and make informed decisions under uncertainty.
The three guiding principles you’ll memorize (summarized in course):
What was compared? This helps assess whether claims predict actual effects and questions the basis of those claims.
Who’s not here? This emphasizes the representativeness of a sample and questions data provenance.
Incorporate "ish"-ness. Always account for uncertainty and consider potential biases in data.
LaTeX notes:
Typical blood pressure reference in the story:
ext{typical BP} \,\approx\, 140\,\text{mmHg}
Common range observed:
120 \le BP \le 160\quad \text{(common, not extreme)}
Outlier concern: value 110 is below the lower bound of the typical range, i.e., 110 < 120\,.
Chapter 2: Statistical Thinking and Reasoning
The Statistical Investigative Cycle
Statistics as a problem-solving lens for real-world problems.
Five components of the statistical investigative cycle:
1) Problem: What do you want to learn by doing a statistical analysis?
2) Plan: Which methods and study designs are best suited to address the problem?
3) Data: How will you measure and store information?
4) Analysis: What analyses should you conduct, and how?
5) Conclusion: How do the results help answer the problem?Visual cycle (as described in the text):
PROBLEM → PLAN → DATA → ANALYSIS → CONCLUSION → (back to) PROBLEM with new questions
The cycle is a framework to organize information and guide practice.
Application: Grandfather BP decision using the Statistical Investigative Cycle
Real-world problem: Should we give the BP medication after a reading of 110?
Statistical problem: Is 110 far enough from typical BP to justify pausing medication, given daily variability?
Statistical plan: No fresh data collection needed; use past BP readings (historical data) to assess variability. The plan leverages existing information to form a model.
Data: Historical BP readings over the past several months; daily values used to build intuition about the distribution.
Analysis: Build a simple statistical model of BP variance:
Typical BP (center) ≈ ext{BP}_{typical} \approx 140
Variability: values around 120–160 are not rare; 110 is far from typical and from the 120 lower threshold.
Conclusion from the model: 110 is not consistent with the grandfather’s typical BP, even after accounting for day-to-day variation; thus, the decision to pause the medication is reasonable.
Conclusion: Based on the analysis, pause and consider further evaluation rather than reflexively administering the pill.
Takeaway: The cycle helps structure thinking about problems, data, and decisions in uncertain situations.
Key phrases from the course that encapsulate practice:
There is no statistics without context.
The five components form a repeatable workflow for problem-solving.
How the cycle relates to the grandfather example:
Problem: Should we administer BP medication after reading 110?
Plan/Data/Analysis: Use existing BP history to judge whether 110 is typical or unusual.
Conclusion: 110 is atypical; pause before acting; seek additional information or monitoring.
Chapter 3: Data and Datasets
Data as representations; tidy data concept
Core idea: Statistics is a lens to reason about variability by capturing what happens in the form of data.
Data are not naturally occurring; they are constructed representations of observations.
Focus: tidy data as a standard way to map real-world observations into a dataset.
Two main components of a tidy dataset:
Observations correspond to rows.
Attributes correspond to columns.
Key terminology:
Observations: the things we are interested in (e.g., individual students, days in a future health-tracking window).
Attributes: the pieces of information collected about each observation (e.g., Major, Year-in-School, GPA).
Examples to illustrate tidy data
Example 1: Student dataset
Observations: individual students (Alexa, Biraj, Chang, Deji, Elaina) – 5 observations.
Attributes (columns): Name, Major, Year-in-School, High School GPA.
Note: The first row containing the header (e.g., "Name") is not an observation.
Data layout: 5 rows of student observations; 4 attribute columns (Name, Major, Year-in-School, High School GPA) – Name is itself an attribute.
Example 2: Grandfather health indicators over 90 days (tidy data example)
Observations: days in the 90-day period (one observation per day).
Attributes: Date, Blood Pressure, Pulse, Oxygen Saturation (4 attributes).
Rationale: tidy data supports compatibility with data collection and analysis tools.
Why tidy data matters
Standard format: tidy data is a standard format compatible with many software tools.
Facilitates analysis: easier to apply consistent analyses and generate insights across datasets.
Practice questions (practice data interpretation)
Example #1: Google stock dataset
Columns/attributes: Date, Open, Close, Volume; Date is the observation identifier; Open/Close/Volume are attributes.
Observations: one row per date.
Answers from the text:
a) Number of observations: 5 (Oct 2–6, 2023).
b) Number of attributes: 4 (Date, Open, Close, Volume).
Example #2: Great Lakes environmental dataset
Observations: 5 lakes (Superior, Huron, Michigan, Erie, Ontario).
Attributes: Lake, Native Species, Total Species, Average Depth, Surface Area.
Answers from the text:
a) Number of observations: 5.
b) Number of attributes: 5 (Lake plus the four measurements).
Chapter 4: Measures and Measurement
Section 1: Types of Measures
Central question: How should you capture information about observations? This depends on the type of data.
Types of data:
Quantitative data: numeric values representing quantities; may include decimals. Examples:
Blood pressure measurements like 142 mmHg, 128 mmHg, 100 mmHg.
Indicators such as total annual revenue, total number of fish species in a lake.
Categorical data: values drawn from a set of category labels (non-numeric labels). Examples:
Eye color: 'brown', 'green', 'blue'.
A student’s major chosen from a set of options.
Whether a chemical compound is organic or inorganic.
Rating scale data: ordered categories or numeric scales with an intrinsic order, not the same as quantitative data.
Examples: {strongly disagree, disagree, agree, strongly agree}; {never, rarely, sometimes, often, always}; pain scale 1–10; year-in-school: {freshman, sophomore, junior, senior}.
Important note: rated/ordered data (ordinal) are treated similarly to rating scales in analysis.
Text data: open-ended textual responses; can be single words or sentences; examples:
Student writing their major in an open-ended field.
Writing an equation like energy, mass, and speed of light relation in text.
Time series data: values that indicate a moment in time (date, month, year, etc.). Distinguishes from durations; a time attribute typically can answer "When?".
Examples: day of full moon, birth year, month you started your first job.
Distinguishing notes:
Time series vs duration: time series is about when something occurred; durations are quantitative values but not time-series data.
Quick exercises (concept checks): identify type (quantitative, categorical, rating scale, text, or time series) for various attributes (your prompts in the chapter)
Examples include: Smartphone operating system, smartphone model, usage frequency, most-used app, number of smartphones owned.
Section 2: Reliability and the Measurement Process
Reliability is a core property of a measurement: the extent to which data reflect the true world characteristics of observations.
Measurement process considerations:
Different ways to measure a given attribute (e.g., temperature) yield different data quality.
The by-hand temperature check is likely less reliable than digital tools or calibrated devices.
Circumference measurement examples: different methods include tape measure, smartphone app, or wrapping arms around a tree trunk.
Practical questions to ask when measuring:
Which measure best reflects the true characteristic of the observation?
How reliable is the data generated by each method?
There is no universal golden rule for reliability; the key is to assess whether the data accurately represents the real-world observation for each case.
Takeaway: Reliability is about trustworthiness of measurements; always consider how measurement choices affect conclusions.
Additional practical implications:
When data are biased or non-representative, conclusions can be misleading. Consider who or what is included in the sample and how it was collected.
Incorporating uncertainty (the "ish"-ness) helps guard against overconfident or overgeneralized claims.
Measurement reliability ties directly to decision-making quality in real-world contexts (e.g., medical decisions, policy decisions, business analytics).
Connections to prior and future learning:
The emphasis on variability, context, and measurement reliability sets the foundation for more advanced topics in statistical modeling, hypothesis testing, and data ethics.