PSTAT 5LS – Intro to Data (Slide Set 1)

Welcome to PSTAT 5LS – Intro to Data (Slide Set 1, pp. 1–2)
Today’s topic: Intro to Data
Tomorrow (Wednesday): No lecture
Thursday: continue Intro to Data
Deadlines
• Homework 1: Tuesday $\text{July 1, 11:59 PM}$
• Homework 2: Friday $\text{July 4, 11:59 PM}$
Office-hour schedule starts next week and is posted on Canvas
• Instructor office hours: T & R, 2–3 PM (Zoom)

Real-world data are messy; course examples will be pre-cleaned “tidy” data.
Three skill pillars introduced in this slide set:
• Key terminology
• Graphical displays that reveal structure
• Numerical summaries that quantify insights

Collected at a hawk blind near Lake MacBride, Iowa, by Cornell College students & faculty.
Full data set >1000 hawks; we analyze a random sample of 50.
Variables:
• Year – year measured (quantitative)
• BandNumber – ID band code (categorical, even though it looks numeric)
• Species – CH (= Cooper’s), RT (= red-tailed), SS (= sharp-shinned) (categorical)
• Wing – length of primary wing feather (mm) (quantitative)
• Weight – body weight (g) (quantitative)
• Culmen – upper-bill length (mm) (quantitative)
• Hallux – killing-talon length (mm) (quantitative)
• Tail – tail-length-related measure (mm) (quantitative)
Data-table structure: one row = one hawk (individual, case, or observational unit); one column = one variable. Such rectangular structure makes it easy to add rows (new observations) or columns (new variables).

Quantitative (numerical/measurement)
• Values are numbers on which arithmetic makes sense (e.g., averages).
• Ex: Year, Wing length, Weight, Culmen, Hallux, Tail.
Categorical (qualitative)
• Values are group labels; arithmetic is not meaningful.
• Ex: Species, BandNumber.
Converting quantitative → categorical
• Possible by grouping continuous values into ranges (e.g., Age → Age Group 18–24, 25–34, etc.).
• The reverse (categorical → quantitative) is not possible.
• Numeric coding of categories (e.g., 1 = Yes, 2 = No) does not change variable type.

Distribution = values + how often they occur.
Two complementary tools
1. Graphical displays – show patterns, trends, outliers, and relationships.
2. Numerical summaries – quantify center, spread, etc.

Bar Chart
• Bars for each category; height = count or percentage.
Pie Chart
• Slice size = percentage of whole (must sum to $100\%$ ).
Hawk-species example (sample $n=50$ ):
• Cooper’s (CH) = 6, Red-tailed (RT) = 27, Sharp-shinned (SS) = 17
• Either representation conveys relative frequencies.

Dot Plot
• Each observation = a dot on an axis.
• Works best for small data sets and few unique values.
• Demo: R’s $n=32$ mtcars dataset mpg values.
Histogram
• Breaks the numeric axis into bins/classes; bar height = frequency in bin.
• Better for larger data sets and many unique values.
Hawk-weight histogram (all species) reveals two main weight clusters → suggests two species share similar weight distribution, third species differs.
Narrow focus: histogram of 69 Cooper’s hawks (full data set) used to practice interpretation.

Example questions asked of Cooper’s-hawks histogram:
• “What % weigh $\le 200\text{ g}?$ ”
• “What % weigh between $300$ and $600\text{ g}?$ ”
⇒ Estimate by comparing bar areas to total area (or counts to $n=69$ ).

Mode(s): number of peaks → unimodal, bimodal, multimodal.
Symmetry:
• Symmetric = mirror-image around mid-point.
• Right-skewed (positive) = long right tail.
• Left-skewed (negative) = long left tail.

Body Temperatures (n ≈ 1000): evaluate whether historic “ $98.6^{\circ}\text{F}$ ” claim holds; histogram used to inspect center, spread, and potential outliers.
Final Course Percentages for PSTAT 5LS (prior term): histogram demonstrates grade distribution; invites description (e.g., skew, modality, clustering).

Multiple peaks often signal hidden subgroups. Summarizing with a single mean/SD can be misleading.
Golden retriever weights example (n=200)
• Overall histogram is bimodal.
• Separating into female vs. male produces two unimodal histograms.
• Proper analysis requires stratifying by subgroup.

Outliers = observations outside overall pattern.
Never remove automatically—investigate first (data entry error? true unusual value?).
Body-temperature histogram shows one reading $\approx 101.5{-}102^{\circ}\text{F}$ .
• Students are asked: Is that an outlier? Why/why not?

When writing results, always specify whether you report a statistic or a parameter and use correct symbol.

Good data practice includes:
• Ensuring representativeness of samples to support valid inference.
• Being transparent about cleaning, transformations, and reasons for omitting data (e.g., outliers).
• Choosing summaries appropriate to distribution shape (modalities, skewness) to avoid misleading conclusions.