categorical variables
places an individual into one of several groups or categories.
quantitative variables
takes numerical values for which it makes sense to find an average.
distribution
tells us what values the variable takes and how often it takes these values; pattern of variation.
data table
lists individuals.
frequency table
summarizes distribution in counts.
relative frequency table
summarizes distribution in percents.
two-way table
a table used to describe two categorical variables.
marginal distribution
the distribution of values of a categorical variable among all individuals described by the table.
conditional distribution
describes values of variable among individuals who have a specific value of another variable; there is a different conditional distribution for each value of the other variable.
segmented bar graph
a "stacked" bar graph that shows parts of a whole; forces us to use percents, easy to compare.
association
high/low amounts of V1 associated with high/low amounts of V2.
characteristics to address when describing the distribution of a quantitative variable
shape
outliers
center
spread
shape
skewness, symmetry
center
mean, median
spread
range, standard deviation
histogram
labels, equal classification widths
what to do with boundary values (whole number on next bar or lower bar?)
make dot plot first
minimum of five bins (bars)
relative frequency histogram
makes it easier to compare two distributions, especially when number of individuals is very different.
x bar
mean of a sample
μ
mean of a population
resistant measures of center
median - YES, outliers don't affect the number of items in a set
mean - NO, mean is pulled in the direction of skewness
how does the shape of a distribution affect the relationship between the mean and the median?
skew right: mean > median
skew left: mean < median
symmetric: mean = median
range
max - min
not resistant measure of spread
quartiles
median of observations to left and right of median
IQR
Q3 - Q1
resistant measure of spread
outliers
Q1 - 1.5(IQR) = lower boundary
Q3 + 1.5(IQR) = upper boundary
five-number summary
minimum, Q1, median, Q3, maximum -> boxplot
standard deviation
the typical distance of the values in the data set from the mean.
(dispersion, spread, variation)
similarities between range, IQR, standard deviation
all measure spread
differences between range, IQR, standard deviation
range is least resistant to outliers
standard deviation is slightly resistant
IQR is most resistant
properties of standard deviation
measures spread about the mean; only use when mean is chosen as center.
Sx is always greater than or equal to 0.
Sx has the same measurement units as data (original observations).
Sx is NOT resistant.
factors to consider when choosing summary statistics
center and spread of distribution
skewed/outlier: median, IQR (resistant)
symmetric data without outliers: mean, standard deviation
always graph for shape (histogram)
four-part question
State the question.
Plan (set up).
Do (calculate).
Conclude (in context).