Looks like no one added any tags here yet for you.
exponential model
y=ab^x (note that a is not the y-int and b is not the slope, they are just placeholders)
- if there is a common ratio (or approximately common) for each equal time period, you have exponential growth/decay
- common ratio > 1: growth- 0 < common ratio < 1: decay
- make sure to note that you can't use the world exponential unless it has been proven by the data
- we usually to study/decay over time
- x vs log y- LSRL: log y^ = a +bx
Statistics
the science and art of collecting, analyzing, and drawing conclusions from data
Individuals
- an object described in a set of data -- can be people, animals, or things
- WHO/WHAT are we gathering information about?
Variables
- an attribute that can take different values for different individuals
- what do we want to know about these individuals?
Qualitative/Categorical Variables
- assigns labels that place each individual into a particular group called a category
- distinct groups/classifications; can be numerical values that make no sense to average (phone numbers)
Marginal relative frequency
- gives the percent or proportion of individuals that have a specific value for one categorical variable
Conditional relative frequency
- gives the percent or proportion of individuals that have a specific value for one categorical variable among individuals who share the same value of another categorical variable (the condition
Simpson's paradox
- an association between two variables that
holds for each value of a third variable can be changed or even reversed when the data for all values of the third variable are combined
Side-by-side bar graph
Displays the distribution of a categorical variable for each value of another categorical variable. The bars are grouped together based on the values of one of the categorical variables and placed side by side.
Segmented bar graph
displays the distribution of a categorical variable as segments of a rectangle, with the area of each segment proportional to the percent of individuals in the corresponding category
Mosaic plot
a modified segmented bar graph in which the width of each rectangle is proportional to the number of individuals in the corresponding category
Association
- if knowing the value of one variable helps us predict the value of the other, there is association
- if knowing the value of one variable does not help us predict the value of the other, there is no association
Dot Plot
shows each data value as a dot above its location on a number line
first quartile
the median of the data values that are to the left of the median in the ordered list
cumulative relative frequency graph (ogive)
plots a point corresponding to the percentile of a given value in a distribution of quantitative data. consecutive points are then connected with a line segment to form the graph
no association
If knowing the value of one variable does not help you predict the value of the other.
cluster sampling
selects a sample by randomly choosing clusters and including each member of the selected clusters in the sample
experiment
deliberately imposes some treatment on individuals to measure their responses
random assignment
experimental units are assigned to treatments using a chance process
Quantitative Variables
- takes number values that are quantities -- counts or measurements
- makes sense to carry out arithmetic operations like adding and averaging
Discrete Variable
- a quantitative variable that takes a fixed set of possible values with gaps between them (shoe size)
Continuous Variable
- a quantitative variable that can take any value in an interval on the number line (GPA)
Distribution
tells us what values the variable takes and how often it takes these values
Bar Graph (Bar Chart)
- shows each category as a bar
- the heights of the bars show the category frequencies or relative frequencies
- 1 categorical variable
Two-way (contingency) tables
- table of counts that summarizes data on the relationship between two categorical variables for some group of individuals
Joint relative frequency
- gives the percent or proportion of individuals that have a specific value for one categorical variable and a specific value for another categorical variable
Symmetric
- if the right side of the graph (containing the half of observations with the largest values) is approximately a mirror image of the left side
Skewed to the left
if the left side of the graph is much longer than the right side
Skewed to the right
if the right side of the graph is much longer than the left side
Stem plot
Shows each data value separated into two parts: a stem, which consists of all but the final digit, and a leaf, the final digit. The stems are ordered from lowest to highest and arranged in a vertical column. The leaves are arranged in increasing order out from the appropriate stems.
Histogram
Shows each interval of values as a bar. The heights of the bars show the frequencies or relative frequencies of values in each interval.
Mean
the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores
statistic
a number that describes some characteristic of a sample
parameter
a number that describes some characteristic of the population
resistant
not sensitive to extreme values
median
midpoint of a distribution, the number such that about half the observations are smaller and about half are larger
range
distance between the minimum value and the maximum value
variance
average squared deviation s^2
standard deviation
- measures the typical distance of the values in a distribution from the mean
- average of squared deviations and then taking the square root
- square root of variance
quartiles
divide the ordered data set into four roups having roughly the same number of values
third quartile
the median of the data values that are to the right of the median in the ordered list
interquartile range
distance between the first and third quartiles of a distribution
outliers
individuals values that fall outside the overall pattern of a distribution
five-number summary
The minimum, first quartile (Q1), median, third quartile (Q3), and the maximum
box plot
visual representation of five-number summary
modified box plot
A box plot that indicates which data values, if any, are outliers by representing them as dots separate from the box plot. The whisker(s) connect the box to the lowest and/or highest data values that are not outliers, instead of the minimum and/or maximum values.
percentile
the 5th percentile of a distribution is the value with p% of observations less than or equal to it
standardized (z-score)
tells us how many standard deviations from the mean the value falls, and in what direction
density curve
models the distribution of a quantitative variable with a curve that
- is always on/above the horizontal axis
- has area exactly 1 underneath it
mean of a density curve
the point at which the curve would balance if made of solid material
median of a density curve
the equal-areas point, the point that divides the area under the curve in half
normal curve
a symmetric, single-peaked, bell-shaped density curve
normal distribution
- specified by mean and standard deviation
- described by a symmetric, single-peaked, bell-shaped density curve
Empirical Rule (68-95-99.7)
In a normal distribution, about 68% of the terms are within one standard deviation of the mean, about 95% are within two standard deviations, and about 99.7% are within three standard deviations
Standard normal distribution
the normal distribution with mean 0 and standard deviation of 1
assess for normality method 1
1) construct a dot plot/stem plot (time-consuming), box plot (stay away from using it as support), or histogram (default, and iffy, then boxplot)
2) see if the graph is approximately symmetrical and bell-shaped about the mean
3) mark off the points at x +/- s, x +/- 2s, x +/- 3s. then compare the count of observations in each interval with the Empirical Rule
normal probability plot
A scatterplot of the ordered pair (data value, expected z-score) for each of the individuals in a quantitative data set. That is, the x-coordinate of each point is the actual data value and the y-coordinate is the expected z-score corresponding to the percentile of that data value in a standard Normal distribution.
assess for normality method 2
1. Construct a normal probability plot
2. Plotted points will lie close to a straight line if the distribution is close to a normal distribution
3. Outliers will appear as points that are far away from the overall pattern of the plot
explanatory variables
may help explain or predict changes in a response variable
independent variables
explanatory variables
response variables
measures an outcome of a study
dependent variables
response variables
positive association
when the values of one variable tend to increase as the values of the other variable increase
negative association
when the values of one variable tend to decrease as the values of the other variable increase
correlation coefficient
- r
- measures the direction and strength of the association
least squares regression line (LSRL)
line that models how a response variable y changes as an explanatory variable x changes
- y hat = a+bx
- line that makes the sum. of the squared residuals as small as possible
extrapolation
Use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line.
residual
the difference between the actual value of y and the value of y predicted by the regression line
scatterplots
shows the relationship between two quantitative variables measured on the same individuals. the values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis
intercept
predicted value of y when x = 0
slope
the amount by which the predicted value of y changes when x increases by 1 unit
residual plot
a scatterplot that displays the residuals on the vertical axis and the explanatory variable on the horizontal axis
standard deviation of residuals
s measures size of a typical residual
- measures the typical distance between the actual y values and the predicted y values
coefficient of determination
measures the percent reduction in the sum of squared residuals when using the LSRL to make predictions, rather than the mean value of y
- measures the percent of the variability in the response variable that is accounted for by the LSRL
high leverage points in regression
have much larger/much smaller x values than the other points in the data set
outliers in regression
point that does not follow the pattern of the data and has a large residual
influential points in regression
any point that, if removed, substantially changes the slope, y-intercept, correlation, coefficient of determination, or standard deviation of the residuals
power model
y = ax^b
- if y is proportional to a power of x, we should use a power model- log(x) vs log(y)
- LSRL: log y^ = a + b(log(x))
population
the entire group of individuals we want information about
census
collects data from every individual in the population
sample
a subset of individuals in the population from which we actually collect data
convenience sampling
selects individuals from the population who are easy to reach
bias
likely to underestimate/overestimate the value you want to know
voluntary response sampling
allows people to choose to be in the sample by responding to a general invitation
voluntary response bias
- people who self-select to participate in such surveys are usually not representative of the population of interest
- attracts people who feel strongly about an issue, and who often share the same opinion
random sampling/random selection
involves using a chance process to determine which members of a population are included in the sample
simple random sample
chosen in such a way that every group of n individuals in the population has an equal chance to be selected as the sample
strata
groups of individuals in a population who share characteristics thought to be associated with the variables being measured in a study
stratified random sampling
selects a sample by choosing an SRS from each stratum and combining the SRSs into one overall sample
cluster
group of individuals in the population that are located near each other
systematic random sampling
selects a sample from an ordered arrangement of the population by randomly selecting one of the first k individuals and choosing every kth individual thereafter
multistage sampling
combines two or more sampling methods
undercoverage
occurs when some members of the population are less likely to be chosen or cannot be chosen in a sampel
nonresponse
occurs when an individual chosen for the sample can't be contacted or refuses to participate
wording of questions bias
confusing/leading questions
response bias
occurs when there is a systematic pattern of inaccurate answers to a survey question
observational study
observes individuals and measures variables of interest but does not attempt to influence the responses
retrospective observational studies
observational study that examines existing data for a sample of individuals
prospective observational studies
observational studies that track individuals into the future
confounding
occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other