1/90
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
descriptive statistics
describe characteristics of sample data (e.g. mean, median, variance)
inferential statistics
extrapolate away from sample data to population of interest by giving the probability that a certain outcome will fall within a given range if randomly drawn from the population (core concept: we can infer something about the whole from a part if the part is representative)
mutually exclusive events
outcomes that cannot both be true at the same time; P(A or B or C) = P(A) + P(B) + P(C)
mutually exclusive and exhaustive events
outcomes that cannot both be true at the same time and that cover all possible outcomes; P(A or B or C) = 1
histogram
shows the distribution of a variable across a set of outcomes by using bins; wider/fewer bins = boxier distribution; narrower/more bins = more idiosyncratic distribution
probability density function (PDF)
shows the probability associated with different outcomes of a variable by plotting a line; takes an average value of a given bandwidth and plots that point then joins together all the points; increased bandwidth = smoother distribution; decreased bandwidth = more idiosyncratic distribution; the area under the curve = the probability that x will fall within that interval/range; to calculate the area under the curve, you need the slope of the curve which is generated by a density function (either kernel df or normal distribution can estimate the slope)
properties of a normal distribution
1) symmetrical
2) 68% of observations fall within ~1 SD of mean; 95% of observations fall within ~2 SD of mean
3) No skew/kurtosis, so only need mean and SD to plot it
4) mean, median, and mode are = to each other
5) Central Limit Theorem: provided sample is big enough, the mean of individual draws of a random variable will produce a normal distribution
random variable
a variable that takes on a set of values with specific probabilities associated with each value such that it creates a distribution
support
all the values that could possibly be taken; for a normal distribution, it's infinite
mean
most likely value we would expect to find if we were to draw someone at random from the sample because it's where you find the most observations in your distribution; x bar = the sum of each individual (i) up to n/ the total n
standard deviation
the square root of the variance; the average distance that a point will vary from the mean; a measure of the spread of your data; SD = the square root of the sum of the (difference between each individual data point and the mean)^2 /n-1 (if sample, n if population)
variance
a measure of the spread of your data; variance = the sum of the (difference between each individual data point and the mean)^2 /n-1 (if sample, n if population); it is squared so that the deviations above and below the mean don't cancel one another out
standardization
changes the unit being measured to SD w/o changing relative values; without this, you can't compare variables that were measured with different units; without this, it's hard to identify how big the differences are in your data; once standardized, we know that the average deviation from the mean is 1, so anything bigger than 1 is a big deviation, and something like .1 is small
dummy variable
has a value of 0 or 1; 1 means the observation has the characteristic in question; 0 means it does not
nominal variable
a variable with no value or ranking associated; simply named categories; e.g. religion, marital status; numbers are assigned to these categories arbitrarily for the purposes of analysis
ordinal variable
a variable with relative ranking but no numerical value; e.g. level of school completed being primary, secondary, uni; or likert scale values like poor, fair, good, excellent
cross-sectional data
data that provides a snapshot of characteristics of many observational units; there's potential for selection bias here
repeated cross-sectional data
data collected about the same characteristics but from different random samples each time; good for measuring institutional improvement rather than individual improvement
time series data
data about one characteristic observed over time; e.g. unemployment from 1990-Present
panel data/longitudinal data
data from the same observational units and mostly the same characteristics again and again over time; good for measuring individual changes over time and long-term impacts
sample
a subset of a larger population that we want to know something about; should match the population of interest as much as possible
population
the full group that we want to understand; the group from which we draw our sample
sampling error
the difference between our point estimate and the true value of the population (i.e. the population parameter); due to random sampling
point estimate
our best guess at a population parameter given our sample (often a mean or a regression coefficient); the single numerical value estimated for an unknown parameter of interest using a sample
sample parameter
a particular stat of our sample (e.g. SD, variance)
random sub-sample
a random selection of observations from our sample
estimated parameter
a sample parameter; it is a random variable itself because it has different values depending on the sample we drew; these values have different probabilities associated with them, so with more random draws/samples, they create a (normal) distribution
Central Limit Theorem (CLT)
w/ more random draws, the distribution of the sample means will be normal; the mean of the sample means will equal the population mean or something very close to the population mean; it applies to all underlying distributions
standard error
the standard deviation of a sampling distribution; SD of x/sq rt (n); SD is typically the average distance of each individual observed from the mean when we're looking at population data; with SE we are looking at the likely distance of a sample mean from the true population mean
sample statistic
a single numerical value calculated from individuals in sample of larger population; can be used as a proxy for population stat if sample was randomly selected and large enough
population statistic
a single numerical value calculated from observations of every individual in group of interest; often not practical to actually collect
hypothetical value of statistic
the value that we want to test against our observed value in hypothesis testing
steps for hypothesis testing
1) state a null and alternative hypothesis
2) define a significance or confidence level
3) construct a test based on significance level (e.g. confidence interval, test statistic, p-value)
4) reject or fail to reject null hypothesis
type I error
rejecting the null when it is true (i.e. you reject the truth when it was true); chances of this error increase when variance is bigger
type II error
failing to reject the null when it is false (i.e. you believe the deceiver when it was a lie); chances of this error decrease with increased sample size/power
null hypothesis
the starting point of testing a claim about a population; the benchmark against which actual outcomes can be measured; status quo, no difference, no impact; failure to reject this means that any difference or variation observed in the data is due to random chance; rejecting this means that the observed difference in the data is due to something outside of random chance (e.g. group membership, an intervention)
requirements for hypothesis testing
two mutually exclusive and exhaustive events
significance level
the highest probability that we are willing to accept that we mistakenly reject the null hypothesis when it is true (i.e. of a false positive/a type I error); often set at 1, 5, or 10% levels; represented by alpha
confidence level
the probability that we correctly fail to reject the null when it is true; probability that estimated confidence interval contains the true population value; 1-alpha; set at 95% when alpha = .05; the probability of not making a Type I error
confidence interval
the range of values in which the true population mean could be that are consistent with the data in our sample; has a lower bound and an upper bound, which vary from sample to sample; estimates values between which the true parameter is likely to lie at some minimum level of probability (that we decide!); can be visualized with shaded bands or error bands; lower bound calculated as the mean - (critical value x the standard error) and upper bound calculated as mean + (critical value x the standard error)
confidence region
the area between the upper and lower bounds of the confidence intervals such that the area = confidence level
rejection region
the area outside the confidence region; the values outside of our confidence interval for which we can reject our null hypothesis; for a two-tailed test, alpha/2; for a one-tailed test, it equals alpha
alternative hypothesis
what is favored when the null is rejected; there's an impact or a true difference (due to something other than random chance) in the observed data
z-distribution
the normal distribution around our point estimate standardized with mean = 0 and SE = 1
z-score
the number of SD between a data obs and the mean; can be used when the true population SD is known; z = (x - μ) / σ
t-distribution
the non-normal distribution around our point estimate when we insert the estimated SD for the true population SD; has slightly less area in the center of the distribution and slightly fatter tails because of the greater margin of error due to the increased uncertainty added in by using the second estimated parameter of SD; it approaches the normal distribution as the degrees of freedom increases (i.e. the sample the size)
one-sample test
a hypothesis test which tests whether the true population parameter equals a hypothesized value of interest; uses the point estimate from a sample based on the CLT
two-sample test
a hypothesis test which tests whether population parameters for two different variables or one population parameter from two different sub-samples are statistically significantly different from one another; in creating our hypotheses, we set both parameters equal to each other or set the difference between them equal to 0
critical z-score
the # of SEs away from the mean at which we can reject the null hypothesis; varies based on significance level; bigger z-score for smaller level of significance level
pareidolia
the tendency of humans to look for meaningful patterns even in the meaningless (e.g. hot hand fallacy); a reason that many nonprofits wrongly attribute impacts to their program
mean reversion/regression to the mean
the tendency for bad things to get better on their own; if they were at their worst when the client enters the program, they were likely to regress/revert to the mean anyway, even without the intervention; a reason that many nonprofits wrongly attribute impacts to their program
confirmation bias
the tendency for humans to look for information that supports their preconceived ideas about what's true; a reason that many nonprofits wrongly attribute impacts to their program
what to consider when presenting program outcomes
1) the groups we want to summarize
2) the outcomes that are important
3) the metric used
4) the statistic used
mean
the average; sum of all obs/# of obs; good for continuous variables; good for symmetrical data; a measure of central tendency
median
exact middle observation (or the average of the middle two observations); good for asymmetrical data or data with a skew/outliers; a measure of central tendency
mode
the most commonly observed outcome in the data; good for categorical variables
proportion
the mean of a dummy (1, 0) variable is the proportion of the sample that had the characteristic defined as 1
weighted mean
a mean calculated by assigning weights to each group average based on the proportion of the individuals within each group
conditional mean
a mean that is calculated only for the observations that meet certain conditions (e.g. the mean of females in the sample, or the mean of females under 25 in the sample)
range
a measure of spread/variance in the data; R=H-L; not useful if there are big outliers in the dataset
variance
a measure of spread/variance in the data; the squared deviation of every individual observation from the mean observation/# of observations; the standard deviation squared; it is squared to make sure that deviations above and below the mean don't just cancel one another out
standard deviation
a measure of spread/variance in the data; the square root of the variance; the average distance of each data point from the mean
formula for coefficient of variation
SD/mean
uniform distribution
the distribution in which the probability/frequency of each outcome is equal; can be created by repeating a single die roll
normal distribution
a symmetrical bell-shaped curve distribution; can be created by repeating 2 simultaneous die rolls over and over
binomial distribution
a distribution in which the frequency/probability of the two outcomes is .5; can be created by a fair coin toss
cumulative density function (CDF)
a function/graph displaying the percent of observations below a certain point
skewness
measures skew (the extent to which observations are clustered at one end or the other) of a distribution
right skew
positive skew; high outliers; e.g. income
left skew
negative skew; low outliers; e.g. gestational age of babies when they are born
kurtosis
measures the height/peak of a distribution; e.g. average age of a high school kid would likely have high kurtosis >3; average age of the overall population would likely have low kurtosis <3
formula for confidence interval
point estimate +/- margin of error
formula for margin of error
CV*SE
formula for standard deviation
sq. root of (sum of (x-xbar)^2/n-1)
formula for variance
sum of (x-xbar)^2/n-1
formula for standard error
SD/sq rt (n)
critical value
the dividing point between the region where the null hypothesis is rejected (rejection region) and the region where it is not rejected (confidence region)
covariance
a measure of linear association between two variables; positive values indicate a positive relationship; negative values indicate a negative relationship
formula for covariance
sum of ((x-xbar)(y-ybar)) for each individual up to n/n
correlation coefficient
a statistical index of the relationship between two variables (from -1 to +1); the covariance/(SDx)(SDy); measures both the strength and direction of a linear relationship between two variables
types of causality
causality: A-->B
reverse causality: A<-->B
third variable causality: Z-->A and Z-->B
moderated causality: A--Z—>B
mediated causality: A-->Z-->B
interquartile range
a measure of the spread of a dataset; 75th percentile-25th percentile
key elements of graphical presentation
1) simple, elegant, clear
2) tells a story
3) represents the data with integrity
4) uses the right graphic for the right job
types of graphics for each statistic
means: bar charts, histograms, density plots, pie charts, lollipop graphs
variation: histograms, box plots, density plots
correlation: scatterplots (w/ regression line or smoothing line), heat maps
time series: line graphs (w/ vertical line)
motivations for using indices
1) to measure abstract constructs that are not directly observable and may be multi-faceted
2) to summarize several factors into one measure
3) to avoid statistical errors like p-hacking
straight index
sums up all the individual component variables of the index
Kling index
standardizes each component variable before summing them
Anderson index
like a Kling index, but w/ variable weighting
first principle component index
finds line of best fit and the variation from that line for each component variable
minimum outcomes index
sets the index score to the lowest value among the set of component variable outcomes (e.g. you're only as healthy as your least healthiest body part, so that body's part's score is your overall health score)
kernel density function
the estimate of the unknown PDF of a random variable based on a finite sample of data points; for each data point in the sample, a kernel function is placed centered at that point, then these individual kernel functions are summed and normalized to create a smooth curve that estimates the underlying probability distribution