individuals
the objects described by a set of data; may be people, animals, or things
variables
Any characteristics of an individual; can take different values for different individuals
categorical variable
a variable that places and individual into one of several groups or categories
quantitative variable
a variable that takes numerical values for which it makes sense to find an average
distribution
tells us what values the variable takes and how often it takes these values
frequency
the number of times a particular value for a variable has been observed
relative frequency
the ratio that compares the frequency of each category to the total frequency
pie graphs/charts
used only when you want to emphasize each category's relation to the whole
two-way table
a way to display the frequencies of two categorial variables; one variable is represented by rows, the other by columns
marginal distribution
in a two-way table of counts, the distribution of values of one of the categorical variables among all individuals described by the table
conditional distribution
describes the values of a variable among individuals who have a specific value of another variable; basically, looking for the values of this variable that satisfy a condition of the other variable
side-by-side bar graph
used to compare the distribution of a categorical variable in each of several groups; for each value of the categorical variable, there is a bar corresponding to each group. can be in counts of percents
segmented bar graph
displays the distribution of a categorical variable as segments of a rectangle, with the area of each segment proportional to the percent of individuals in the corresponding category
association between variables
if knowing the value of one variable helps predict the value of the other; if it doesn't then there is no association (the bar graphs would look the same)
dotplot
a graph w/ a horizontal axis and w/ dots above locations on the number line; displays quantitative variables . . . remember to label the graph
stemplot
used for fairly small data sets; show distribution by putting the final digit on the outside (leaves) and having the first digit(s) on the inside (stem) . . . remember to add a key . . . can also have a back-to-back stemplot
histogram
nearby values of quantitative data are grouped together . . . bars are side by side/connected . . . can be frequency counts of relative frequency
"describe this distribution"
describe shape, outliers, center, spread, and include context
outliers and rule
any point that lies MORE than 1.5 IQR's from either quartile
1.5IQR+Q3< = outlier Q1-1.51QR> = outlier
skewed left/right
a non-symmetrical distribution where one tail stretches out further (to the left/right) than the other . . . if the long tail is to the right, it's skewed-right, if the long tail is to the left, it's skewed-left
"compare these distributions"
describe the shape, outliers, spread, center, of each, but use comparative words/phrases and explain how they differ from each other . . . include CONTEXT
mean vs. median (when to use)
use the mean (and SD) when you have symmetric data with NO outliers or skewness . . . use the median (and IQR) when you have heavy skewness or outliers because the median is resistant
mean of population vs. sample
use X̄ (x-bar) when you are describing the mean of a sample . . . use μ (mew) when you are describing the mean of a population (whole thing)
standard deviation of population vs. sample
use s_x when you are describing the standard deviation of a sample . . . use σ (sigma) when you are describing the standard deviation of a population (whole thing)
quartiles
values that divide a data set into four equal parts . . . first (lower) quartile is @ 25th percentile and halfway between the minimum and the median . . . second quartile is @ 50th percentile and is the median . . . third (upper) quartile is @ 75th percentile and is halfway between the median and the maximum . . . the fourth quartile is irrelevant,
IQR (interquartile range)
third quartile - first quartile; the middle half /50% of the data
5 number summary
consists of the minimum, first quartile, median (second quartile), third quartile, and maximum
range
maximum - minimum . . . is a single number . . . you cannot say the range is 100-300 , must say the range is somewhere between 100-300, etc.
boxplot
a graph that does not display shape very well and does not display amount of observations but does display the 5 number summary in the form of a split box with two "whiskers" . . . also called a box-and-whisker plot
variance
the standard deviation squared . . . does not use the same units as the standard deviation and original data, so can only be used to prove something mathematically . . . s_x^2
standard deviation
the average deviation of data from the mean . . . ex: on average, the football scores deviate/are off from the mean by 3 points . . . lowest the SD can be is 0 . . . measured with same units as original (when all points are the same) . . . is not resistant
resistant measure of center
the median is a resistant measure of center because it is only taking into account one point (the center point)
mean/SD vs. median/IQR
the mean/SD are NOT resistant (because they use every data point) and will be affected by outliers and skewness, so they should only be used to describe a distribution when the data is roughly symmetric . . . the median/IQR ARE resistant (because they only use 1-2 points) and should be used when there is heavy skewness or outliers
percentile
the value with p percent of the observations less than or equal to it . . . expressed as a percentile . . . interpreted as: "the value of ___ is at the pth percentile. about p percent of the values are less than or equal to ___."
z-score (standardized score)
a measure of how many standard deviations you are away from the mean (negative = below, positive = above) . . . calculated by (observation - mean)/(standard deviation)
cumulative relative frequency graph
can be used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution . . . uses percentiles on y-axis . . . the steeper areas mean more observations in that area, and vice versa for gradually growing areas
recentering vs. rescaling
recentering is when you add/subtract a constant to the distribution, moving it on the x-axis either left or right, NOT changing shape, spread (range and IQR), SD, . . . rescaling is when you multiply/divide by a constant, either making it more spread apart or closer together, NOT changing shape, median, mean
density curve
a mathematical curve that is always on or above the horizontal axis, has an area of 1 underneath it, and describes the overall pattern of a distribution . . . outliers are NOT described by the curve
find mean/median in density curve
when the density curve is symmetric, the mean/median are the same and are in the middle . . . when the curve is skewed-right, the mean will be closer to the tail than the median, and the median will be at the middle of the data while the mean will be @ the "balance point" . . . vice versa for skewed-left distributions
Normal distributions of data
distributions that fall in a bell-shaped shape and follow somewhat closely the empirical (68-95-99.7) rule . . . can be modeled by a Normal curve/model
Normal curve/model
mathematical model that describes normal distributions . . . they have the same overall pattern: symmetrical, single-peaked, bell-shaped . . . described by giving it's mean and SD (larger SD means more flat)
60%-95%-99.7% (empirical) rule of thumb
in a Normal model, 68% of data will be between 1 SD of the mean, 95% within two SD's, and 99.7% within three SD's
standard Normal model
the Normal model w/ mean 0 and SD 1 . . . the completely standardized Normal distribution
Normal probability plot
a display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition . . . found by getting the percentiles of each observation, then the z-scores for every percentile, and plot the data x w/ expected z-scores on the y-axis
response variable
on the y-axis, measures an outcome of a study
explanatory variable
on the x-axis, may help explain or predict changes in a response variable
correlation (r)
measures the direction and strength of the LINEAR relationship between two QUANTITATIVE variables . . . just because correlation is high does not indicate linear-ness . . . can be -1 ≤ r ≤ 1 , where 0 is no correlation, and ±1 is perfect correlation . . . has NO unit of measurement . . . does NOT imply causation . . . NOT resistant . . . when x and y are flipped, the correlation r stays the same
regression line
a line that describes how a response variable y changes as an explanatory variable x changes . . . oftentimes, these lines are used to predict the value of y for a given value of x . . . ONLY used when one variable helps explain/predict the other . . . also known as line of best fit
regression line equation
ŷ = a +bx
ŷ (y hat) is the PREDICTED value of the response variable y for a given value of the explanatory variable x
b is the slope, the amount by which y is PREDICTED to change when x increases by one unit
a is the y-intercept, the PREDICTED value of y when x=0
extrapolation
the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line, these predications are NOT accurate . . . sometimes the y-intercept is an extrapolation because x=0 wouldn't make sense or makes y negative
residuals
the difference between an observed value of the response variable and the value predicted by the regression line (vertical difference) = observed y - predicted y = y - ŷ
least squares regression line (LSRL)
the line of y on x that makes the sum of the squared residuals as small as possible . . . it's the residuals squared because if you didn't square them, when you added them together they would all cancel out . . . the mean of the least squares residuals is always 0
residual plot
a scatterplot of the residuals against the explanatory variable . . . helps to assess whether a linear model is appropriate . . . turns the regression line horizontal . . . if random scatter is on the plot, it is linear, if there is a pattern left over (such as a curve), it's not linear and the linear model is not appropriate
standard deviation of the residuals (s)
measures the the typical/approximate size of the typical prediction errors (residuals) when using the regression line . . . is s . . . written in original units . . . interpreted as: "when using the LSRL w/ x=[explanatory] to PREDICT y=[response], the model will typically be off by about ____ units."
coefficient of determination (r^2)
the PERCENTAGE of the variation in the values of y that is accounted for by the LSRL of y on x . . . no units . . . measured 0 (does not predict at all) ≤ r ≤ 1 (perfect) . . . is the correlation squared . . . interpreted as: "___% of the variation in [response] is accounted for/explained by the linear model on [explanatory]."
describing slope of LSRL
"This model PREDICTS that for every 1 additional [explanatory], there is an increase by ____ more [response]."
describing y-intercept of LSRL
"This model PREDICTS that [explanatory] of 0 (context) would have a [response] of ____."
outlier in regression
a point that does not follow the GENERAL TREND shown in the rest of the data AND has a LARGE RESIDUAL when the LSRL is calculated
high-leverage point
a point in regression with a substantially larger or smaller x-value than the other observations
influential point
any point in regression that, if removed, changes the relationship substantially (much dif, slope, y-int, correlation, or r^2) . . . oftentimes, outliers and high-leverage points are influential
writing LSRL equations
ŷ = a +bx
b= correlation * (standard deviation of y's/ standard deviation of x's) b= r (s_y)/(s_x)
a= mean of y values * slope * mean of x values a= ȳ-bx̄
LSRL always passes through point (x̄,ȳ)
regression to the mean
in a LSRL, ŷ is going to be closer to ȳ than x is to x̄, except for when r = 1 or -1 . . . ŷ is r*(s_y) above ȳ, whereas x is just 1(s_x) above x̄
standardizing regressions
(x̄,ȳ) becomes (0,0), s_x = s_y = 1, and b=r (slope is equal to the correlation), because b= r (s_y)/(s_x), b= r (1/1) . . .
describing scatterplots
form (linear, non-linear (curved, etc.))
direction (positive, negative, none)
strength (strong, moderately-strong, moderate, moderately-weak, weak)
outliers (possible outliers, one @ (x,y), etc.)
context (Ex: actual and guessed ages . . .)