Notes for Topic 1: Describing Data Visually and Numerically
STAT 2020: Topic 1 — Describing Data Visually and Numerically
- This topic introduces how to describe data both visually and numerically, focusing on foundational ideas used throughout descriptive statistics.
WHAT IS STATISTICS?
- Statistics is the science of data.
- Data are numbers with a context.
- Biostatistics is the application of statistics to biology, including design and analysis of experiments and observational studies.
TWO BRANCHES OF STATISTICS
- Descriptive Statistics
- Methods for organizing, summarizing, and presenting data in an informative way.
- Inferential Statistics
- Methods for drawing conclusions about a population based on data from a sample.
POPULATION VS SAMPLE
- Population: all subjects or items of interest. Size is denoted by N.
- Sample: a group selected from a population. Size is denoted by n.
- Many different samples can be drawn from a given population; the number of distinct samples depends on both population and sample sizes.
TERMINOLOGY
- Data: observations (measurements, genders, survey responses) collected.
- Parameter: a number describing a population characteristic.
- Statistic: a number describing a sample characteristic (sample statistic).
- The observed value of a statistic is used to estimate the unobserved value of a parameter.
- A statistic is unbiased if its sampling distribution mean equals the parameter it estimates.
INDIVIDUALS AND VARIABLES
- Individuals: the objects described in a data set (people, animals, plants, things).
- Examples: freshmen, newborns, golden retrievers, fields of corn, cells.
- A variable: a property that characterizes an individual; can take different values across individuals.
- Examples: Age, gender, blood pressure, blood type, leaf length, flower color.
TWO TYPES OF VARIABLES
- Quantitative (numerical) variables
- A quantity assessed or measured for each individual; we can report the average.
- Examples: Age (years), blood pressure (mm Hg), leaf length (cm).
- Categorical (qualitative) variables
- A characteristic describing each individual; we can report counts or proportions.
- Examples: Gender (male/female), blood type (A, B, AB, O), flower color (white, yellow, red).
PROBLEM: CLASSIFYING VARIABLES
- Data table example (Patients A–G) with variables such as Diagnosis and Age at death.
- Question: What is being recorded about those individuals? For each variable, is it numeric (quantitative) or a statement (categorical)?
ANSWER: CLASSIFYING VARIABLES (REVIEW)
- Reiterate that for each listed variable, determine whether it is quantitative (numeric) or categorical (qualitative).
COMMON WAYS TO CHART QUANTITATIVE DATA
- Histograms
- A summary graph for a single variable; useful for understanding pattern of variability, especially for large data sets.
- Dotplots (or Stem & Leaf plots)
- Graphs for raw data; useful for describing variability, especially for small data sets.
- Time Series Plots
- Graphs with a sequence on the horizontal axis (e.g., time); a line connects points to emphasize changes over time.
VISUALIZING DATA: HISTOGRAMS
- A histogram is a graph where:
- Horizontal axis: classes of data values.
- Vertical axis: frequencies (or relative frequencies).
- Heights of bars correspond to frequencies, and bars are drawn adjacent to each other.
- Related variants: Relative Frequency Histogram.
MAKING A HISTOGRAM
1) Divide the range of the quantitative variable into equal-size intervals (classes/bins) to form the horizontal axis.
2) Vertical axis represents either the frequency (counts) or the relative frequency (percent of total).
3) For each class, draw a column whose height is the count or percent in that class.
MAKING A HISTOGRAM: GUINEA PIG SURVIVAL TIME EXAMPLE
- Example: Guinea pig survival time (days) after inoculation with a pathogen (n = 72).
- Build a histogram with class size 50, starting at zero (zero included in the first class).
VISUALIZING DATA: DOTPLOTS
- A dotplot shows each data value as a point on a scale; equal values are stacked.
- Example: Height of poplar trees under different treatments; patterns suggest fertilizer increases height.
VISUALIZING DATA: STEM & LEAF PLOT
- Structure:
- Stem on top; leaves listed below.
- Example data shown (stems 7, 6, 5, etc.) with leaves arranged to reveal distribution.
MEASURES OF CENTER
- The center of a data set is a representative value indicating where the data cluster.
- Main measures:
- Mean
- Median
- Mode
MEASURE OF CENTER: THE MEAN
- Definition: The arithmetic average.
- Formula: $$ar{x} = rac{x1 + x2 + \