1/53
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
population
total set of individuals that are of interest
parameter
summarizes the population (μ-mean, and σ-standard deviation)
sample
portion OF the total population (should lack bias)
stastistic
value summarizing the sample (Xbar-mean, s-standard deviation)
cases
individual items on which data is collected
respondents
people who answer survey
subjects/participants
people who are experimented on
experimental units
objects of an experiment when not a person
graphs for categorical data
bar Chart
pie Chart
graphs for quantitative data
dot plot
histogram
box plot
histogram benefit
seeing distribution of the data
good to compare two or three groups
to describe quantitative data
Shape
Outliers (and other unusual features)
Center
Spread
shape
modality/peaks
symmetry and skewness
outliers (what it is, how to find)
data value that is far above or far below the rest of the data
upper outlier: Q3+1.5(IQR)
lower outlier: Q1-1.5(IQR)
center
median and mean
median (what it is, how to find, when to use it)
the middle of the data
order the values and find the one that is positionally the middle value
best for symmetric distributions
resistant to outliers
mean (what it is, how to find, when to use it)
the average
ybar=total/n
good for skewed data
not resistant to outliers
what happens to the mean when the data is skewed
it will be further in the direction of the skewness (ex. right skewed data will have a higher mean than median)
spread (2 main kinds)
standard deviation
IQR
standard deviation (what is it, how to find, what different sd’s mean)
distance of a value from the mean, how tightly packed the data are
s = √(∑ (y−ybar)²/n−1)
small sd= data values less spread out and closer to the mean
quartiles
Q1 (median of lower half of data)= 25th percentile
Q2 (median)= 50th percentile
Q3 (median of upper half of the data)= 75th percentile
Q4= max data in the value, 100th percentile
5 number summary
median, quartiles, min, max, IQR, and range
IQR (how to find, benefits, drawbacks)
IQR=Q3-Q1
reasonable summary of distribution spread
not affected by outliers
most people don’t know what it is
center/spread combos
mean + standard deviation
both best with roughly symmetric data
based on magnitude and values
median + IQR
both best for skewed data
based on order of the data
independence
distribution of one variable is the same for all categories of another
dependent variables
have an association between the two variables
observational studies
look at sample of data to learn more about larger population
often lead to contradictory results because nothing is controlled for or really conclusive
boxplots
central box shows the middle half of the data
height of box=IQR
whiskers show skewness if they are not roughly the same length
if median is centered=roughly symmetric middle half of data
compare many groups
z-score (what is it, how to find it, what do small/large, +/- z scores mean)
how far a value is from the mean in terms of standard deviations but is useful for re-expression of values
z=(y-ybar)/s
small/large: close/far from the mean (respectively)
positive/negative: above/below the mean (respectively)
shifting data (adding or subtracting to all values) does what?
changes only the position, not the spread
mean, median, and quartiles (location) are changed
standard deviation, range, and IQR (spread) remain unchanged
rescaling the data (multiplying it by a constant) does what?
changes points and spread
the normal model (notation, empirical rule)
N ( μ , σ )
68-95-99.7 rule (1σ away, 2σ away, 3σ away)
percentile
the percent of data that falls at or below some value
ex) the 80th percentile HAS 80% of the data below
how to get from values (variables) to z-scores
how to get from z-score to area (%, proportion percentile)
standardize: z= (x-μ)/σ
normcdf(lower z, upper z)
invnorm (percentile or area below)
scatterplots (what do they show, what are they good for)
relationship between two quantitative variables
detect patterns, trends, relationships, extraordinary values
roles for variables (what’s on y and x axis)
y axis= response variable, what we want to predict
x axis= explanatory variable, what is providing info and helping to predict response variable
correlation coefficient (r) (how to find, what it means, conditions)
in ti-84 go to STAT→CALC→8
strength and direction of linear relationship (btw -1 and 1)
need a nearly linear relationship, quantitative variables, and no strong outliers
has no units
changing x and y does not change correlation
correlation does not equal causation
lurking variables
correlation means a LINEAR relationship, an association is ANY relationship
residual (how to find it, what is it)
y-ŷ (ŷ= predicted value)
residual is the difference between the observed value and the predicted value
points above the line have + residuals and below the line have - residuals
line of best fit (lease squares line, regression line)
(what is it, what to do with it, how its written)
line for which the sum of the squares of the residuals is the smallest
squaring the residuals makes them all positive
best fitting line will have small residuals
if you have small residuals, that means you predicted the results of your data well
y=mx+b
line of best fit interpretation
for every (slope) +/- we can expect to see an (intercept) +/- in the (y)
conditions for using regression
quantitative
straight enough
no outliers
what is r²
the fraction of the data’s variation accounted for by the model
found by squaring the residual and subtracting 1 from it
ex) r²= 0.76² = 0.58
1-r² =1-0.58 = 0.42= 42%
interpreted as:
42% of the variation of y is accounted for by the residuals
58% of the variation of y is accounted for by x
extrapolation
the farther away from the mean, the less trust should be put in the predicted value of y
when do values have leverage
x-values that are far from the mean of the rest of the x-values
extreme in y have large residuals
when are values influential
if omitting it from the analysis changes the model enough to make a meaningful difference
determined by:
residual
leverage
simple random sample
guarantees that each person has an equal chance of being selected
ensures that a non-representative sample is unlikely to occur
stratified random sampling
divides the population into HOMOGENOUS groups where proportionate amounts from each group are randomly selected
estimates will be more precise, but watch out for simpson’s paradox
cluster sampling
dividing population into smaller groups
less expensive and less time consuming
some populations are naturally broken into clusters already
HETEREROGENOUS
multistage sampling
combination of several sampling methods
ex) for a college, select dorms as a cluster and then continue with other sampling methods such as a census, etc.
observational studies
researchers do not assign choices, passively observe participants
bad for cause-and-effect establishment
tough to handle lurking variables
retrospective studies
collect data on something that has already occurred
similar pros and cons as obs. studies
prospective studies
study where we identify subjects in advance and collect data as events unfold
possible to isolate variables
can be expensive and time-consuming
4 principals of experimental design
control
control what you can
randomize
randomize the rest
replicate
block
group similar individuals together and randomize within each of these blocks
helps account for variability due to the difference between blocks