PSTAT 5LS – Intro to Data (Slide Set 1)
Course Logistics
Welcome to PSTAT 5LS – Intro to Data (Slide Set 1, pp. 1–2)
Today’s topic: Intro to Data
Tomorrow (Wednesday): No lecture
Thursday: continue Intro to Data
Deadlines
• Homework 1: Tuesday
• Homework 2: FridayOffice-hour schedule starts next week and is posted on Canvas
• Instructor office hours: T & R, 2–3 PM (Zoom)
Core Questions for the Quarter (p. 3)
How do we collect data that answer a research question?
How do we summarize & display data so they can “tell their story”?
What conclusions can legitimately be drawn from the data?
Working with Data (pp. 4–5)
Real-world data are messy; course examples will be pre-cleaned “tidy” data.
Three skill pillars introduced in this slide set:
• Key terminology
• Graphical displays that reveal structure
• Numerical summaries that quantify insights
Hawks Data Set (pp. 6–8)
Collected at a hawk blind near Lake MacBride, Iowa, by Cornell College students & faculty.
Full data set >1000 hawks; we analyze a random sample of 50.
Variables:
• Year – year measured (quantitative)
• BandNumber – ID band code (categorical, even though it looks numeric)
• Species – CH (= Cooper’s), RT (= red-tailed), SS (= sharp-shinned) (categorical)
• Wing – length of primary wing feather (mm) (quantitative)
• Weight – body weight (g) (quantitative)
• Culmen – upper-bill length (mm) (quantitative)
• Hallux – killing-talon length (mm) (quantitative)
• Tail – tail-length-related measure (mm) (quantitative)Data-table structure: one row = one hawk (individual, case, or observational unit); one column = one variable. Such rectangular structure makes it easy to add rows (new observations) or columns (new variables).
Types of Variables (pp. 9–12)
Quantitative (numerical/measurement)
• Values are numbers on which arithmetic makes sense (e.g., averages).
• Ex: Year, Wing length, Weight, Culmen, Hallux, Tail.Categorical (qualitative)
• Values are group labels; arithmetic is not meaningful.
• Ex: Species, BandNumber.Converting quantitative → categorical
• Possible by grouping continuous values into ranges (e.g., Age → Age Group 18–24, 25–34, etc.).
• The reverse (categorical → quantitative) is not possible.
• Numeric coding of categories (e.g., 1 = Yes, 2 = No) does not change variable type.
Exploring a Single Variable – Distributions (pp. 13–14)
Distribution = values + how often they occur.
Two complementary tools
Graphical displays – show patterns, trends, outliers, and relationships.
Numerical summaries – quantify center, spread, etc.
Graphing Categorical Variables (pp. 15–16)
Bar Chart
• Bars for each category; height = count or percentage.Pie Chart
• Slice size = percentage of whole (must sum to ).Hawk-species example (sample ):
• Cooper’s (CH) = 6, Red-tailed (RT) = 27, Sharp-shinned (SS) = 17
• Either representation conveys relative frequencies.
Graphing Quantitative Variables (pp. 17–22)
Dot Plot
• Each observation = a dot on an axis.
• Works best for small data sets and few unique values.
• Demo: R’smtcarsdataset mpg values.Histogram
• Breaks the numeric axis into bins/classes; bar height = frequency in bin.
• Better for larger data sets and many unique values.Hawk-weight histogram (all species) reveals two main weight clusters → suggests two species share similar weight distribution, third species differs.
Narrow focus: histogram of 69 Cooper’s hawks (full data set) used to practice interpretation.
Reading a Histogram (p. 23)
Example questions asked of Cooper’s-hawks histogram:
• “What % weigh ”
• “What % weigh between and ”
⇒ Estimate by comparing bar areas to total area (or counts to ).
Describing Shape (p. 24)
Mode(s): number of peaks → unimodal, bimodal, multimodal.
Symmetry:
• Symmetric = mirror-image around mid-point.
• Right-skewed (positive) = long right tail.
• Left-skewed (negative) = long left tail.
Illustrative Shape Examples (pp. 25–26)
Body Temperatures (n ≈ 1000): evaluate whether historic “” claim holds; histogram used to inspect center, spread, and potential outliers.
Final Course Percentages for PSTAT 5LS (prior term): histogram demonstrates grade distribution; invites description (e.g., skew, modality, clustering).
Multimodal Distributions (pp. 27–29)
Multiple peaks often signal hidden subgroups. Summarizing with a single mean/SD can be misleading.
Golden retriever weights example (n=200)
• Overall histogram is bimodal.
• Separating into female vs. male produces two unimodal histograms.
• Proper analysis requires stratifying by subgroup.
Outliers (pp. 30–31)
Outliers = observations outside overall pattern.
Never remove automatically—investigate first (data entry error? true unusual value?).
Body-temperature histogram shows one reading .
• Students are asked: Is that an outlier? Why/why not?
Center & Variability + Statistical Framework (pp. 32–35)
Population = entire group of interest.
Sample = subset actually measured.
Parameter = numerical summary of a population.
Statistic = numerical summary of a sample.
Key principle: Sample statistics estimate population parameters.
Notation (p. 34)
Measure | Population (Parameter) | Sample (Statistic) |
|---|---|---|
Mean | ||
Std. Dev | ||
Variance |
When writing results, always specify whether you report a statistic or a parameter and use correct symbol.
Practical & Ethical Implications (implicit throughout)
Good data practice includes:
• Ensuring representativeness of samples to support valid inference.
• Being transparent about cleaning, transformations, and reasons for omitting data (e.g., outliers).
• Choosing summaries appropriate to distribution shape (modalities, skewness) to avoid misleading conclusions.