topic 1

Population and Sample

Population: all subjects or items of interest; size N
Sample: a subset drawn from a population; size n
There are many possible samples from a population; number of samples depends on N and n

Terminology

Data: observations collected (e.g., measurements, responses)
Parameter: characteristic of a population
Statistic: characteristic of a sample (sample statistic)
The observed value of a statistic estimates a parameter
A statistic is unbiased if its sampling distribution mean equals the parameter

Individuals and Variables

Individuals: objects described in data (people, animals, plants, things)
Variable: property that characterizes an individual; can take different values

Types of Variables

Quantitative: numeric values; can report a mean across individuals (e.g., age, blood pressure, leaf length)
Categorical: descriptive categories; can report counts or proportions (e.g., gender, blood type, flower color)

Classifying Variables (Brief Example)

Example dataset (from slides): Diagnosis (categorical), Age at death (quantitative)
Question: What is recorded about each individual? Is each variable quantitative or categorical?
- Their diagnosis and age at death are being recorded, with the diagnosis being categorial variables and the age at death being the quantitative variables

Visualizing Quantitative Data: Common Graphs

Histograms: single-variable overview; show pattern of variability; useful for large data sets

Dotplots (or Stem & Leaf): show raw data; useful for small data sets as they describe the pattern of variability
Time Series Plots: data points in sequence (e.g., over time); emphasizes changes over time

Histograms

Histogram: horizontal axis = class intervals (bins); vertical axis = frequencies (counts)
Heights of bars = frequencies; bars are adjacent
Relative Frequency Histogram: vertical axis = relative frequencies (percentage)

Making a Histogram

Divide range of the quantitative variable into equal-size intervals (bins)
Vertical axis = either frequency or relative frequency
For each class, draw a column with height = count or percent in that class

Dotplots

Dotplot: plot each data value on a scale; dots stack for identical values

Stem-and-Leaf Plot

Stem-and-Leaf: data are split into stem (leading digits) and leaf (trailing digits)

Measures of Center

Center = representative value indicating where the data cluster
Main measures: Mean, Median, Mode

The Mean

Definition: mean = arithmetic average
Formula: adding values and dividing by total number of values
Key:
- (mu) = Pop. Mean
- N= # of indv. in a pop.
- x = sample mean
- n = sample size

The Median

Definition: middle value when data are ordered
How to find: sort values; if n is odd, median = middle value; if n is even, median = mean of two middle values divided by 2

The Mode

Definition: most frequent value
If two values share greatest frequency: bimodal
More than two with greatest frequency: multimodal
If no value repeats: no mode

The Best Measure of Center

Mean is not resistant to skews and outliers
Median is resistant to skew and outliers
For approximately symmetric data with one mode, mean ≈ median
For obviously asymmetric data, report both mean and median

Measures of Variation

Variation measures how data values differ from each other
Key concepts: range, standard deviation, variance, quartiles

Range and Standard Deviation

Range = max − min
Standard deviation (sample): s=

Notation: \sigma = population SD; s = sample SD

Variance

Variance (sample) =
Variance = square of SD
Notation
\sigma = Population standard deviation (sigma)
x^2 = Population variance
s = Sample standard deviation
s2 = Sample variance
N = Population size
n = Sample Size

Quartiles and Five-Number Summary

Median is Q2 (second quartile), 50th percentile
Quartiles divide the data values into 4 equal parts
Q1 = 25th percentile; Q3 = 75th percentile
Different procedures can yield different quartiles; not universal
Five-number summary: min, Q1, Median (Q2), Q3, max

Boxplots and IQR

IQR = Q3 − Q1
Boxplot shows min, Q1, median, Q3, max; whiskers extend to data range within 1.5 IQR
Outliers are values outside typical pattern (suspected outliers) beyond 1.5 IQR from quartiles

IQR and Suspected Outliers

Suspected low outlier: value < Q1 − 1.5 × IQR
Suspected high outlier: value > Q3 + 1.5 × IQR

How to Draw a Boxplot

Steps: compute five-number summary; set scale to include min and max; draw box from Q1 to Q3 with median line; extend whiskers to min and max

Standardization: Z-scores

Standardized score (z-score) allows comparison across data sets
- It’s the number of sd that a given value x is above or below the mean
Population: z = \frac{x - \mu}{\sigma}
Sample: z = \frac{x - \bar{x}}{s}

Interpreting Histograms (4 characteristics)

Shape/Distribution: unimodal, bimodal, symmetric, skewed, irregular
Center: approximate midpoint or peak location
Spread: range of observed values
Outliers: any points that may be outliers

Exploratory Data Analysis (EDA)

EDA uses tools to understand center, variation, distribution, outliers, and time
Outliers can dramatically affect the mean, SD, and histogram scale

Boxplot Utility (IQR approach)

Boxplot highlights central tendency and variability; useful for comparing groups

Quick Reference: Example Use of IQR for Outliers

Compute IQR; identify values beyond Q1 − 1.5 IQR or Q3 + 1.5 IQR as possible outliers