PSTAT 5LS – Intro to Data (Slide Set 1)

Course Logistics

  • Welcome to PSTAT 5LS – Intro to Data (Slide Set 1, pp. 1–2)

  • Today’s topic: Intro to Data

  • Tomorrow (Wednesday): No lecture

  • Thursday: continue Intro to Data

  • Deadlines
    Homework 1: Tuesday July 1, 11:59 PM\text{July 1, 11:59 PM}
    Homework 2: Friday July 4, 11:59 PM\text{July 4, 11:59 PM}

  • Office-hour schedule starts next week and is posted on Canvas
    • Instructor office hours: T & R, 2–3 PM (Zoom)

Core Questions for the Quarter (p. 3)

  • How do we collect data that answer a research question?

  • How do we summarize & display data so they can “tell their story”?

  • What conclusions can legitimately be drawn from the data?

Working with Data (pp. 4–5)

  • Real-world data are messy; course examples will be pre-cleaned “tidy” data.

  • Three skill pillars introduced in this slide set:
    • Key terminology
    Graphical displays that reveal structure
    Numerical summaries that quantify insights

Hawks Data Set (pp. 6–8)

  • Collected at a hawk blind near Lake MacBride, Iowa, by Cornell College students & faculty.

  • Full data set >1000 hawks; we analyze a random sample of 50.

  • Variables:
    Year – year measured (quantitative)
    BandNumber – ID band code (categorical, even though it looks numeric)
    Species – CH (= Cooper’s), RT (= red-tailed), SS (= sharp-shinned) (categorical)
    Wing – length of primary wing feather (mm) (quantitative)
    Weight – body weight (g) (quantitative)
    Culmen – upper-bill length (mm) (quantitative)
    Hallux – killing-talon length (mm) (quantitative)
    Tail – tail-length-related measure (mm) (quantitative)

  • Data-table structure: one row = one hawk (individual, case, or observational unit); one column = one variable. Such rectangular structure makes it easy to add rows (new observations) or columns (new variables).

Types of Variables (pp. 9–12)

  • Quantitative (numerical/measurement)
    • Values are numbers on which arithmetic makes sense (e.g., averages).
    • Ex: Year, Wing length, Weight, Culmen, Hallux, Tail.

  • Categorical (qualitative)
    • Values are group labels; arithmetic is not meaningful.
    • Ex: Species, BandNumber.

  • Converting quantitative → categorical
    • Possible by grouping continuous values into ranges (e.g., Age → Age Group 18–24, 25–34, etc.).
    • The reverse (categorical → quantitative) is not possible.
    • Numeric coding of categories (e.g., 1 = Yes, 2 = No) does not change variable type.

Exploring a Single Variable – Distributions (pp. 13–14)

  • Distribution = values + how often they occur.

  • Two complementary tools

    1. Graphical displays – show patterns, trends, outliers, and relationships.

    2. Numerical summaries – quantify center, spread, etc.

Graphing Categorical Variables (pp. 15–16)

  • Bar Chart
    • Bars for each category; height = count or percentage.

  • Pie Chart
    • Slice size = percentage of whole (must sum to 100%100\%).

  • Hawk-species example (sample n=50n=50):
    • Cooper’s (CH) = 6, Red-tailed (RT) = 27, Sharp-shinned (SS) = 17
    • Either representation conveys relative frequencies.

Graphing Quantitative Variables (pp. 17–22)

  • Dot Plot
    • Each observation = a dot on an axis.
    • Works best for small data sets and few unique values.
    • Demo: R’s n=32n=32 mtcars dataset mpg values.

  • Histogram
    • Breaks the numeric axis into bins/classes; bar height = frequency in bin.
    • Better for larger data sets and many unique values.

  • Hawk-weight histogram (all species) reveals two main weight clusters → suggests two species share similar weight distribution, third species differs.

  • Narrow focus: histogram of 69 Cooper’s hawks (full data set) used to practice interpretation.

Reading a Histogram (p. 23)

  • Example questions asked of Cooper’s-hawks histogram:
    “What % weigh 200 g?\le 200\text{ g}?
    “What % weigh between 300300 and 600 g?600\text{ g}?
    ⇒ Estimate by comparing bar areas to total area (or counts to n=69n=69).

Describing Shape (p. 24)

  • Mode(s): number of peaks → unimodal, bimodal, multimodal.

  • Symmetry:
    Symmetric = mirror-image around mid-point.
    Right-skewed (positive) = long right tail.
    Left-skewed (negative) = long left tail.

Illustrative Shape Examples (pp. 25–26)

  • Body Temperatures (n ≈ 1000): evaluate whether historic “98.6F98.6^{\circ}\text{F}” claim holds; histogram used to inspect center, spread, and potential outliers.

  • Final Course Percentages for PSTAT 5LS (prior term): histogram demonstrates grade distribution; invites description (e.g., skew, modality, clustering).

Multimodal Distributions (pp. 27–29)

  • Multiple peaks often signal hidden subgroups. Summarizing with a single mean/SD can be misleading.

  • Golden retriever weights example (n=200)
    • Overall histogram is bimodal.
    • Separating into female vs. male produces two unimodal histograms.
    • Proper analysis requires stratifying by subgroup.

Outliers (pp. 30–31)

  • Outliers = observations outside overall pattern.

  • Never remove automatically—investigate first (data entry error? true unusual value?).

  • Body-temperature histogram shows one reading 101.5102F\approx 101.5{-}102^{\circ}\text{F}.
    • Students are asked: Is that an outlier? Why/why not?

Center & Variability + Statistical Framework (pp. 32–35)

  • Population = entire group of interest.

  • Sample = subset actually measured.

  • Parameter = numerical summary of a population.

  • Statistic = numerical summary of a sample.

  • Key principle: Sample statistics estimate population parameters.

Notation (p. 34)

Measure

Population (Parameter)

Sample (Statistic)

Mean

μ\mu

xˉ\bar x

Std. Dev

σ\sigma

ss

Variance

σ2\sigma^{2}

s2s^{2}

  • When writing results, always specify whether you report a statistic or a parameter and use correct symbol.

Practical & Ethical Implications (implicit throughout)

  • Good data practice includes:
    • Ensuring representativeness of samples to support valid inference.
    • Being transparent about cleaning, transformations, and reasons for omitting data (e.g., outliers).
    • Choosing summaries appropriate to distribution shape (modalities, skewness) to avoid misleading conclusions.