Introduction to Data and Statistics (Video)

Chapter 1: Introduction to Data and Statistics

Section 1: Why Learn Statistics?

  • Personal anecdote sets the stage: a warm August night in Chicago; grandfather in hospital; dizziness and dehydration. Night nurse administers hypertension medication based on BP reading.

  • Observation: grandfather’s daily routine included measuring BP every morning; he typically recorded BP and pulse and kept a notebook.

  • Current BP reading: 110. The nurse explains policy: meds given if BP > 130; not given if BP < 100; when BP is between 100–130, nurse discretion, but meds usually given.

  • Question raised: Is 110 a typical value for this grandfather? If given medication, BP could skyrocket overnight and then crash; if not given medication when it’s actually needed, BP could plummet and crash. Both outcomes are potentially dangerous.

  • The learner decides not to give the medicine, guided by statistical thinking, even without medical training. Confidence rooted in a statistical approach.

  • Core messages:

    • This class teaches how to think statistically and apply statistical thinking to life decisions.

    • Statistics is the science of variability and decision making under uncertainty.

    • By thinking statistically, you can better navigate uncertainty you will face in life.

    • The goal is to learn how to think calmly and confidently in real-world hospital-like situations.

  • Big idea: statistics as a decision-support tool under uncertainty, grounded in understanding variability and typical patterns.

Section 2: What is Statistics?

  • Common misconceptions: statistics is not only math, numbers, graphs, data, equations, or fancy analytics; at heart, statistics is something broader.

  • Core definition: Statistics is a lens—a perspective to approach the world and solve problems in the face of uncertainty. It is the science of variability.

  • Concept of variation:

    • Variation is universal: even in something as simple as hair on a head, no two hairs are identical.

    • When we zoom out, patterns and similarities emerge despite variability.

    • Statistics guides us through uncertainty by focusing on these patterns and similarities rather than single observations.

  • Practical framing: statistics helps you think about how things vary, identify patterns, and make informed decisions under uncertainty.

  • The three guiding principles you’ll memorize (summarized in course):

    • What was compared? This helps assess whether claims predict actual effects and questions the basis of those claims.

    • Who’s not here? This emphasizes the representativeness of a sample and questions data provenance.

    • Incorporate "ish"-ness. Always account for uncertainty and consider potential biases in data.

  • LaTeX notes:

    • Typical blood pressure reference in the story:

    • ext{typical BP} \,\approx\, 140\,\text{mmHg}

    • Common range observed:

    • 120 \le BP \le 160\quad \text{(common, not extreme)}

    • Outlier concern: value 110 is below the lower bound of the typical range, i.e., 110 < 120\,.

Chapter 2: Statistical Thinking and Reasoning

The Statistical Investigative Cycle

  • Statistics as a problem-solving lens for real-world problems.

  • Five components of the statistical investigative cycle:
    1) Problem: What do you want to learn by doing a statistical analysis?
    2) Plan: Which methods and study designs are best suited to address the problem?
    3) Data: How will you measure and store information?
    4) Analysis: What analyses should you conduct, and how?
    5) Conclusion: How do the results help answer the problem?

  • Visual cycle (as described in the text):

    • PROBLEM → PLAN → DATA → ANALYSIS → CONCLUSION → (back to) PROBLEM with new questions

  • The cycle is a framework to organize information and guide practice.

Application: Grandfather BP decision using the Statistical Investigative Cycle

  • Real-world problem: Should we give the BP medication after a reading of 110?

  • Statistical problem: Is 110 far enough from typical BP to justify pausing medication, given daily variability?

  • Statistical plan: No fresh data collection needed; use past BP readings (historical data) to assess variability. The plan leverages existing information to form a model.

  • Data: Historical BP readings over the past several months; daily values used to build intuition about the distribution.

  • Analysis: Build a simple statistical model of BP variance:

    • Typical BP (center) ≈ ext{BP}_{typical} \approx 140

    • Variability: values around 120–160 are not rare; 110 is far from typical and from the 120 lower threshold.

    • Conclusion from the model: 110 is not consistent with the grandfather’s typical BP, even after accounting for day-to-day variation; thus, the decision to pause the medication is reasonable.

  • Conclusion: Based on the analysis, pause and consider further evaluation rather than reflexively administering the pill.

  • Takeaway: The cycle helps structure thinking about problems, data, and decisions in uncertain situations.

  • Key phrases from the course that encapsulate practice:

    • There is no statistics without context.

    • The five components form a repeatable workflow for problem-solving.

  • How the cycle relates to the grandfather example:

    • Problem: Should we administer BP medication after reading 110?

    • Plan/Data/Analysis: Use existing BP history to judge whether 110 is typical or unusual.

    • Conclusion: 110 is atypical; pause before acting; seek additional information or monitoring.

Chapter 3: Data and Datasets

Data as representations; tidy data concept

  • Core idea: Statistics is a lens to reason about variability by capturing what happens in the form of data.

  • Data are not naturally occurring; they are constructed representations of observations.

  • Focus: tidy data as a standard way to map real-world observations into a dataset.

  • Two main components of a tidy dataset:

    • Observations correspond to rows.

    • Attributes correspond to columns.

  • Key terminology:

    • Observations: the things we are interested in (e.g., individual students, days in a future health-tracking window).

    • Attributes: the pieces of information collected about each observation (e.g., Major, Year-in-School, GPA).

Examples to illustrate tidy data

  • Example 1: Student dataset

    • Observations: individual students (Alexa, Biraj, Chang, Deji, Elaina) – 5 observations.

    • Attributes (columns): Name, Major, Year-in-School, High School GPA.

    • Note: The first row containing the header (e.g., "Name") is not an observation.

    • Data layout: 5 rows of student observations; 4 attribute columns (Name, Major, Year-in-School, High School GPA) – Name is itself an attribute.

  • Example 2: Grandfather health indicators over 90 days (tidy data example)

    • Observations: days in the 90-day period (one observation per day).

    • Attributes: Date, Blood Pressure, Pulse, Oxygen Saturation (4 attributes).

    • Rationale: tidy data supports compatibility with data collection and analysis tools.

Why tidy data matters

  • Standard format: tidy data is a standard format compatible with many software tools.

  • Facilitates analysis: easier to apply consistent analyses and generate insights across datasets.

Practice questions (practice data interpretation)

  • Example #1: Google stock dataset

    • Columns/attributes: Date, Open, Close, Volume; Date is the observation identifier; Open/Close/Volume are attributes.

    • Observations: one row per date.

    • Answers from the text:

    • a) Number of observations: 5 (Oct 2–6, 2023).

    • b) Number of attributes: 4 (Date, Open, Close, Volume).

  • Example #2: Great Lakes environmental dataset

    • Observations: 5 lakes (Superior, Huron, Michigan, Erie, Ontario).

    • Attributes: Lake, Native Species, Total Species, Average Depth, Surface Area.

    • Answers from the text:

    • a) Number of observations: 5.

    • b) Number of attributes: 5 (Lake plus the four measurements).

Chapter 4: Measures and Measurement

Section 1: Types of Measures

  • Central question: How should you capture information about observations? This depends on the type of data.

  • Types of data:

    • Quantitative data: numeric values representing quantities; may include decimals. Examples:

    • Blood pressure measurements like 142 mmHg, 128 mmHg, 100 mmHg.

    • Indicators such as total annual revenue, total number of fish species in a lake.

    • Categorical data: values drawn from a set of category labels (non-numeric labels). Examples:

    • Eye color: 'brown', 'green', 'blue'.

    • A student’s major chosen from a set of options.

    • Whether a chemical compound is organic or inorganic.

    • Rating scale data: ordered categories or numeric scales with an intrinsic order, not the same as quantitative data.

    • Examples: {strongly disagree, disagree, agree, strongly agree}; {never, rarely, sometimes, often, always}; pain scale 1–10; year-in-school: {freshman, sophomore, junior, senior}.

    • Important note: rated/ordered data (ordinal) are treated similarly to rating scales in analysis.

    • Text data: open-ended textual responses; can be single words or sentences; examples:

    • Student writing their major in an open-ended field.

    • Writing an equation like energy, mass, and speed of light relation in text.

    • Time series data: values that indicate a moment in time (date, month, year, etc.). Distinguishes from durations; a time attribute typically can answer "When?".

    • Examples: day of full moon, birth year, month you started your first job.

  • Distinguishing notes:

    • Time series vs duration: time series is about when something occurred; durations are quantitative values but not time-series data.

  • Quick exercises (concept checks): identify type (quantitative, categorical, rating scale, text, or time series) for various attributes (your prompts in the chapter)

    • Examples include: Smartphone operating system, smartphone model, usage frequency, most-used app, number of smartphones owned.

Section 2: Reliability and the Measurement Process

  • Reliability is a core property of a measurement: the extent to which data reflect the true world characteristics of observations.

  • Measurement process considerations:

    • Different ways to measure a given attribute (e.g., temperature) yield different data quality.

    • The by-hand temperature check is likely less reliable than digital tools or calibrated devices.

    • Circumference measurement examples: different methods include tape measure, smartphone app, or wrapping arms around a tree trunk.

  • Practical questions to ask when measuring:

    • Which measure best reflects the true characteristic of the observation?

    • How reliable is the data generated by each method?

    • There is no universal golden rule for reliability; the key is to assess whether the data accurately represents the real-world observation for each case.

  • Takeaway: Reliability is about trustworthiness of measurements; always consider how measurement choices affect conclusions.

  • Additional practical implications:

    • When data are biased or non-representative, conclusions can be misleading. Consider who or what is included in the sample and how it was collected.

    • Incorporating uncertainty (the "ish"-ness) helps guard against overconfident or overgeneralized claims.

    • Measurement reliability ties directly to decision-making quality in real-world contexts (e.g., medical decisions, policy decisions, business analytics).

  • Connections to prior and future learning:

    • The emphasis on variability, context, and measurement reliability sets the foundation for more advanced topics in statistical modeling, hypothesis testing, and data ethics.