Notes for Topic 1: Describing Data Visually and Numerically

STAT 2020: Topic 1 — Describing Data Visually and Numerically

This topic introduces how to describe data both visually and numerically, focusing on foundational ideas used throughout descriptive statistics.

WHAT IS STATISTICS?

Statistics is the science of data.
Data are numbers with a context.
Biostatistics is the application of statistics to biology, including design and analysis of experiments and observational studies.

TWO BRANCHES OF STATISTICS

Descriptive Statistics
- Methods for organizing, summarizing, and presenting data in an informative way.
Inferential Statistics
- Methods for drawing conclusions about a population based on data from a sample.

POPULATION VS SAMPLE

Population: all subjects or items of interest. Size is denoted by N.
Sample: a group selected from a population. Size is denoted by n.
Many different samples can be drawn from a given population; the number of distinct samples depends on both population and sample sizes.

TERMINOLOGY

Data: observations (measurements, genders, survey responses) collected.
Parameter: a number describing a population characteristic.
Statistic: a number describing a sample characteristic (sample statistic).
The observed value of a statistic is used to estimate the unobserved value of a parameter.
A statistic is unbiased if its sampling distribution mean equals the parameter it estimates.

INDIVIDUALS AND VARIABLES

Individuals: the objects described in a data set (people, animals, plants, things).
- Examples: freshmen, newborns, golden retrievers, fields of corn, cells.
A variable: a property that characterizes an individual; can take different values across individuals.
- Examples: Age, gender, blood pressure, blood type, leaf length, flower color.

TWO TYPES OF VARIABLES

Quantitative (numerical) variables
- A quantity assessed or measured for each individual; we can report the average.
- Examples: Age (years), blood pressure (mm Hg), leaf length (cm).
Categorical (qualitative) variables
- A characteristic describing each individual; we can report counts or proportions.
- Examples: Gender (male/female), blood type (A, B, AB, O), flower color (white, yellow, red).

PROBLEM: CLASSIFYING VARIABLES

Data table example (Patients A–G) with variables such as Diagnosis and Age at death.
Question: What is being recorded about those individuals? For each variable, is it numeric (quantitative) or a statement (categorical)?

ANSWER: CLASSIFYING VARIABLES (REVIEW)

Reiterate that for each listed variable, determine whether it is quantitative (numeric) or categorical (qualitative).

COMMON WAYS TO CHART QUANTITATIVE DATA

Histograms
- A summary graph for a single variable; useful for understanding pattern of variability, especially for large data sets.
Dotplots (or Stem & Leaf plots)
- Graphs for raw data; useful for describing variability, especially for small data sets.
Time Series Plots
- Graphs with a sequence on the horizontal axis (e.g., time); a line connects points to emphasize changes over time.

VISUALIZING DATA: HISTOGRAMS

A histogram is a graph where:
- Horizontal axis: classes of data values.
- Vertical axis: frequencies (or relative frequencies).
- Heights of bars correspond to frequencies, and bars are drawn adjacent to each other.
Related variants: Relative Frequency Histogram.

MAKING A HISTOGRAM

1) Divide the range of the quantitative variable into equal-size intervals (classes/bins) to form the horizontal axis.
2) Vertical axis represents either the frequency (counts) or the relative frequency (percent of total).
3) For each class, draw a column whose height is the count or percent in that class.

MAKING A HISTOGRAM: GUINEA PIG SURVIVAL TIME EXAMPLE

Example: Guinea pig survival time (days) after inoculation with a pathogen (n = 72).
Build a histogram with class size 50, starting at zero (zero included in the first class).

VISUALIZING DATA: DOTPLOTS

A dotplot shows each data value as a point on a scale; equal values are stacked.
Example: Height of poplar trees under different treatments; patterns suggest fertilizer increases height.

VISUALIZING DATA: STEM & LEAF PLOT

Structure:
- Stem on top; leaves listed below.
- Example data shown (stems 7, 6, 5, etc.) with leaves arranged to reveal distribution.

MEASURES OF CENTER

The center of a data set is a representative value indicating where the data cluster.
Main measures:
- Mean
- Median
- Mode

MEASURE OF CENTER: THE MEAN

Definition: The arithmetic average.
Formula: $$ar{x} = rac{x1 + x2 + \