topic 1
Population and Sample
Population: all subjects or items of interest; size N
Sample: a subset drawn from a population; size n
There are many possible samples from a population; number of samples depends on N and n
Terminology
Data: observations collected (e.g., measurements, responses)
Parameter: characteristic of a population
Statistic: characteristic of a sample (sample statistic)
The observed value of a statistic estimates a parameter
A statistic is unbiased if its sampling distribution mean equals the parameter
Individuals and Variables
Individuals: objects described in data (people, animals, plants, things)
Variable: property that characterizes an individual; can take different values
Types of Variables
Quantitative: numeric values; can report a mean across individuals (e.g., age, blood pressure, leaf length)
Categorical: descriptive categories; can report counts or proportions (e.g., gender, blood type, flower color)
Classifying Variables (Brief Example)
Example dataset (from slides): Diagnosis (categorical), Age at death (quantitative)
Question: What is recorded about each individual? Is each variable quantitative or categorical?
Their diagnosis and age at death are being recorded, with the diagnosis being categorial variables and the age at death being the quantitative variables
Visualizing Quantitative Data: Common Graphs
Histograms: single-variable overview; show pattern of variability; useful for large data sets

Dotplots (or Stem & Leaf): show raw data; useful for small data sets as they describe the pattern of variability
Time Series Plots: data points in sequence (e.g., over time); emphasizes changes over time
Histograms
Histogram: horizontal axis = class intervals (bins); vertical axis = frequencies (counts)
Heights of bars = frequencies; bars are adjacent
Relative Frequency Histogram: vertical axis = relative frequencies (percentage)
Making a Histogram
Divide range of the quantitative variable into equal-size intervals (bins)
Vertical axis = either frequency or relative frequency
For each class, draw a column with height = count or percent in that class
Dotplots
Dotplot: plot each data value on a scale; dots stack for identical values
Stem-and-Leaf Plot
Stem-and-Leaf: data are split into stem (leading digits) and leaf (trailing digits)
Measures of Center
Center = representative value indicating where the data cluster
Main measures: Mean, Median, Mode
The Mean
Definition: mean = arithmetic average
Formula: adding values and dividing by total number of values
Key:
(mu) = Pop. Mean
N= # of indv. in a pop.
x = sample mean
n = sample size
The Median
Definition: middle value when data are ordered
How to find: sort values; if n is odd, median = middle value; if n is even, median = mean of two middle values divided by 2
The Mode
Definition: most frequent value
If two values share greatest frequency: bimodal
More than two with greatest frequency: multimodal
If no value repeats: no mode
The Best Measure of Center
Mean is not resistant to skews and outliers
Median is resistant to skew and outliers
For approximately symmetric data with one mode, mean ≈ median
For obviously asymmetric data, report both mean and median

Measures of Variation
Variation measures how data values differ from each other
Key concepts: range, standard deviation, variance, quartiles
Range and Standard Deviation
Range = max − min
Standard deviation (sample): s=
Notation: \sigma = population SD; s = sample SD
Variance
Variance (sample) =
Variance = square of SD
Notation
\sigma = Population standard deviation (sigma)
x^2 = Population variance
s = Sample standard deviation
s2 = Sample variance
N = Population size
n = Sample Size
Quartiles and Five-Number Summary
Median is Q2 (second quartile), 50th percentile
Quartiles divide the data values into 4 equal parts
Q1 = 25th percentile; Q3 = 75th percentile
Different procedures can yield different quartiles; not universal
Five-number summary: min, Q1, Median (Q2), Q3, max
Boxplots and IQR
IQR = Q3 − Q1
Boxplot shows min, Q1, median, Q3, max; whiskers extend to data range within 1.5 IQR
Outliers are values outside typical pattern (suspected outliers) beyond 1.5 IQR from quartiles
IQR and Suspected Outliers
Suspected low outlier: value < Q1 − 1.5 × IQR
Suspected high outlier: value > Q3 + 1.5 × IQR
How to Draw a Boxplot
Steps: compute five-number summary; set scale to include min and max; draw box from Q1 to Q3 with median line; extend whiskers to min and max
Standardization: Z-scores
Standardized score (z-score) allows comparison across data sets
It’s the number of sd that a given value x is above or below the mean
Population: z = \frac{x - \mu}{\sigma}
Sample: z = \frac{x - \bar{x}}{s}
Interpreting Histograms (4 characteristics)
Shape/Distribution: unimodal, bimodal, symmetric, skewed, irregular
Center: approximate midpoint or peak location
Spread: range of observed values
Outliers: any points that may be outliers
Exploratory Data Analysis (EDA)
EDA uses tools to understand center, variation, distribution, outliers, and time
Outliers can dramatically affect the mean, SD, and histogram scale
Boxplot Utility (IQR approach)
Boxplot highlights central tendency and variability; useful for comparing groups
Quick Reference: Example Use of IQR for Outliers
Compute IQR; identify values beyond Q1 − 1.5 IQR or Q3 + 1.5 IQR as possible outliers