Descriptive Statistics and Inferential Statistics

0.0(0)
studied byStudied by 3 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/90

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

91 Terms

1
New cards

descriptive statistics

describe characteristics of sample data (e.g. mean, median, variance)

2
New cards

inferential statistics

extrapolate away from sample data to population of interest by giving the probability that a certain outcome will fall within a given range if randomly drawn from the population (core concept: we can infer something about the whole from a part if the part is representative)

3
New cards

mutually exclusive events

outcomes that cannot both be true at the same time; P(A or B or C) = P(A) + P(B) + P(C)

4
New cards

mutually exclusive and exhaustive events

outcomes that cannot both be true at the same time and that cover all possible outcomes; P(A or B or C) = 1

5
New cards

histogram

shows the distribution of a variable across a set of outcomes by using bins; wider/fewer bins = boxier distribution; narrower/more bins = more idiosyncratic distribution

6
New cards

probability density function (PDF)

shows the probability associated with different outcomes of a variable by plotting a line; takes an average value of a given bandwidth and plots that point then joins together all the points; increased bandwidth = smoother distribution; decreased bandwidth = more idiosyncratic distribution; the area under the curve = the probability that x will fall within that interval/range; to calculate the area under the curve, you need the slope of the curve which is generated by a density function (either kernel df or normal distribution can estimate the slope)

7
New cards

properties of a normal distribution

1) symmetrical

2) 68% of observations fall within ~1 SD of mean; 95% of observations fall within ~2 SD of mean

3) No skew/kurtosis, so only need mean and SD to plot it

4) mean, median, and mode are = to each other

5) Central Limit Theorem: provided sample is big enough, the mean of individual draws of a random variable will produce a normal distribution

8
New cards

random variable

a variable that takes on a set of values with specific probabilities associated with each value such that it creates a distribution

9
New cards

support

all the values that could possibly be taken; for a normal distribution, it's infinite

10
New cards

mean

most likely value we would expect to find if we were to draw someone at random from the sample because it's where you find the most observations in your distribution; x bar = the sum of each individual (i) up to n/ the total n

11
New cards

standard deviation

the square root of the variance; the average distance that a point will vary from the mean; a measure of the spread of your data; SD = the square root of the sum of the (difference between each individual data point and the mean)^2 /n-1 (if sample, n if population)

12
New cards

variance

a measure of the spread of your data; variance = the sum of the (difference between each individual data point and the mean)^2 /n-1 (if sample, n if population); it is squared so that the deviations above and below the mean don't cancel one another out

13
New cards

standardization

changes the unit being measured to SD w/o changing relative values; without this, you can't compare variables that were measured with different units; without this, it's hard to identify how big the differences are in your data; once standardized, we know that the average deviation from the mean is 1, so anything bigger than 1 is a big deviation, and something like .1 is small

14
New cards

dummy variable

has a value of 0 or 1; 1 means the observation has the characteristic in question; 0 means it does not

15
New cards

nominal variable

a variable with no value or ranking associated; simply named categories; e.g. religion, marital status; numbers are assigned to these categories arbitrarily for the purposes of analysis

16
New cards

ordinal variable

a variable with relative ranking but no numerical value; e.g. level of school completed being primary, secondary, uni; or likert scale values like poor, fair, good, excellent

17
New cards

cross-sectional data

data that provides a snapshot of characteristics of many observational units; there's potential for selection bias here

18
New cards

repeated cross-sectional data

data collected about the same characteristics but from different random samples each time; good for measuring institutional improvement rather than individual improvement

19
New cards

time series data

data about one characteristic observed over time; e.g. unemployment from 1990-Present

20
New cards

panel data/longitudinal data

data from the same observational units and mostly the same characteristics again and again over time; good for measuring individual changes over time and long-term impacts

21
New cards

sample

a subset of a larger population that we want to know something about; should match the population of interest as much as possible

22
New cards

population

the full group that we want to understand; the group from which we draw our sample

23
New cards

sampling error

the difference between our point estimate and the true value of the population (i.e. the population parameter); due to random sampling

24
New cards

point estimate

our best guess at a population parameter given our sample (often a mean or a regression coefficient); the single numerical value estimated for an unknown parameter of interest using a sample

25
New cards

sample parameter

a particular stat of our sample (e.g. SD, variance)

26
New cards

random sub-sample

a random selection of observations from our sample

27
New cards

estimated parameter

a sample parameter; it is a random variable itself because it has different values depending on the sample we drew; these values have different probabilities associated with them, so with more random draws/samples, they create a (normal) distribution

28
New cards

Central Limit Theorem (CLT)

w/ more random draws, the distribution of the sample means will be normal; the mean of the sample means will equal the population mean or something very close to the population mean; it applies to all underlying distributions

29
New cards

standard error

the standard deviation of a sampling distribution; SD of x/sq rt (n); SD is typically the average distance of each individual observed from the mean when we're looking at population data; with SE we are looking at the likely distance of a sample mean from the true population mean

30
New cards

sample statistic

a single numerical value calculated from individuals in sample of larger population; can be used as a proxy for population stat if sample was randomly selected and large enough

31
New cards

population statistic

a single numerical value calculated from observations of every individual in group of interest; often not practical to actually collect

32
New cards

hypothetical value of statistic

the value that we want to test against our observed value in hypothesis testing

33
New cards

steps for hypothesis testing

1) state a null and alternative hypothesis

2) define a significance or confidence level

3) construct a test based on significance level (e.g. confidence interval, test statistic, p-value)

4) reject or fail to reject null hypothesis

34
New cards

type I error

rejecting the null when it is true (i.e. you reject the truth when it was true); chances of this error increase when variance is bigger

35
New cards

type II error

failing to reject the null when it is false (i.e. you believe the deceiver when it was a lie); chances of this error decrease with increased sample size/power

36
New cards

null hypothesis

the starting point of testing a claim about a population; the benchmark against which actual outcomes can be measured; status quo, no difference, no impact; failure to reject this means that any difference or variation observed in the data is due to random chance; rejecting this means that the observed difference in the data is due to something outside of random chance (e.g. group membership, an intervention)

37
New cards

requirements for hypothesis testing

two mutually exclusive and exhaustive events

38
New cards

significance level

the highest probability that we are willing to accept that we mistakenly reject the null hypothesis when it is true (i.e. of a false positive/a type I error); often set at 1, 5, or 10% levels; represented by alpha

39
New cards

confidence level

the probability that we correctly fail to reject the null when it is true; probability that estimated confidence interval contains the true population value; 1-alpha; set at 95% when alpha = .05; the probability of not making a Type I error

40
New cards

confidence interval

the range of values in which the true population mean could be that are consistent with the data in our sample; has a lower bound and an upper bound, which vary from sample to sample; estimates values between which the true parameter is likely to lie at some minimum level of probability (that we decide!); can be visualized with shaded bands or error bands; lower bound calculated as the mean - (critical value x the standard error) and upper bound calculated as mean + (critical value x the standard error)

41
New cards

confidence region

the area between the upper and lower bounds of the confidence intervals such that the area = confidence level

42
New cards

rejection region

the area outside the confidence region; the values outside of our confidence interval for which we can reject our null hypothesis; for a two-tailed test, alpha/2; for a one-tailed test, it equals alpha

43
New cards

alternative hypothesis

what is favored when the null is rejected; there's an impact or a true difference (due to something other than random chance) in the observed data

44
New cards

z-distribution

the normal distribution around our point estimate standardized with mean = 0 and SE = 1

45
New cards

z-score

the number of SD between a data obs and the mean; can be used when the true population SD is known; z = (x - μ) / σ

46
New cards

t-distribution

the non-normal distribution around our point estimate when we insert the estimated SD for the true population SD; has slightly less area in the center of the distribution and slightly fatter tails because of the greater margin of error due to the increased uncertainty added in by using the second estimated parameter of SD; it approaches the normal distribution as the degrees of freedom increases (i.e. the sample the size)

47
New cards

one-sample test

a hypothesis test which tests whether the true population parameter equals a hypothesized value of interest; uses the point estimate from a sample based on the CLT

48
New cards

two-sample test

a hypothesis test which tests whether population parameters for two different variables or one population parameter from two different sub-samples are statistically significantly different from one another; in creating our hypotheses, we set both parameters equal to each other or set the difference between them equal to 0

49
New cards

critical z-score

the # of SEs away from the mean at which we can reject the null hypothesis; varies based on significance level; bigger z-score for smaller level of significance level

50
New cards

pareidolia

the tendency of humans to look for meaningful patterns even in the meaningless (e.g. hot hand fallacy); a reason that many nonprofits wrongly attribute impacts to their program

51
New cards

mean reversion/regression to the mean

the tendency for bad things to get better on their own; if they were at their worst when the client enters the program, they were likely to regress/revert to the mean anyway, even without the intervention; a reason that many nonprofits wrongly attribute impacts to their program

52
New cards

confirmation bias

the tendency for humans to look for information that supports their preconceived ideas about what's true; a reason that many nonprofits wrongly attribute impacts to their program

53
New cards

what to consider when presenting program outcomes

1) the groups we want to summarize

2) the outcomes that are important

3) the metric used

4) the statistic used

54
New cards

mean

the average; sum of all obs/# of obs; good for continuous variables; good for symmetrical data; a measure of central tendency

55
New cards

median

exact middle observation (or the average of the middle two observations); good for asymmetrical data or data with a skew/outliers; a measure of central tendency

56
New cards

mode

the most commonly observed outcome in the data; good for categorical variables

57
New cards

proportion

the mean of a dummy (1, 0) variable is the proportion of the sample that had the characteristic defined as 1

58
New cards

weighted mean

a mean calculated by assigning weights to each group average based on the proportion of the individuals within each group

59
New cards

conditional mean

a mean that is calculated only for the observations that meet certain conditions (e.g. the mean of females in the sample, or the mean of females under 25 in the sample)

60
New cards

range

a measure of spread/variance in the data; R=H-L; not useful if there are big outliers in the dataset

61
New cards

variance

a measure of spread/variance in the data; the squared deviation of every individual observation from the mean observation/# of observations; the standard deviation squared; it is squared to make sure that deviations above and below the mean don't just cancel one another out

62
New cards

standard deviation

a measure of spread/variance in the data; the square root of the variance; the average distance of each data point from the mean

63
New cards

formula for coefficient of variation

SD/mean

64
New cards

uniform distribution

the distribution in which the probability/frequency of each outcome is equal; can be created by repeating a single die roll

65
New cards

normal distribution

a symmetrical bell-shaped curve distribution; can be created by repeating 2 simultaneous die rolls over and over

66
New cards

binomial distribution

a distribution in which the frequency/probability of the two outcomes is .5; can be created by a fair coin toss

67
New cards

cumulative density function (CDF)

a function/graph displaying the percent of observations below a certain point

68
New cards

skewness

measures skew (the extent to which observations are clustered at one end or the other) of a distribution

69
New cards

right skew

positive skew; high outliers; e.g. income

70
New cards

left skew

negative skew; low outliers; e.g. gestational age of babies when they are born

71
New cards

kurtosis

measures the height/peak of a distribution; e.g. average age of a high school kid would likely have high kurtosis >3; average age of the overall population would likely have low kurtosis <3

72
New cards

formula for confidence interval

point estimate +/- margin of error

73
New cards

formula for margin of error

CV*SE

74
New cards

formula for standard deviation

sq. root of (sum of (x-xbar)^2/n-1)

75
New cards

formula for variance

sum of (x-xbar)^2/n-1

76
New cards

formula for standard error

SD/sq rt (n)

77
New cards

critical value

the dividing point between the region where the null hypothesis is rejected (rejection region) and the region where it is not rejected (confidence region)

78
New cards

covariance

a measure of linear association between two variables; positive values indicate a positive relationship; negative values indicate a negative relationship

79
New cards

formula for covariance

sum of ((x-xbar)(y-ybar)) for each individual up to n/n

80
New cards

correlation coefficient

a statistical index of the relationship between two variables (from -1 to +1); the covariance/(SDx)(SDy); measures both the strength and direction of a linear relationship between two variables

81
New cards

types of causality

causality: A-->B

reverse causality: A<-->B

third variable causality: Z-->A and Z-->B

moderated causality: A--Z—>B

mediated causality: A-->Z-->B

82
New cards

interquartile range

a measure of the spread of a dataset; 75th percentile-25th percentile

83
New cards

key elements of graphical presentation

1) simple, elegant, clear

2) tells a story

3) represents the data with integrity

4) uses the right graphic for the right job

84
New cards

types of graphics for each statistic

means: bar charts, histograms, density plots, pie charts, lollipop graphs

variation: histograms, box plots, density plots

correlation: scatterplots (w/ regression line or smoothing line), heat maps

time series: line graphs (w/ vertical line)

85
New cards

motivations for using indices

1) to measure abstract constructs that are not directly observable and may be multi-faceted

2) to summarize several factors into one measure

3) to avoid statistical errors like p-hacking

86
New cards

straight index

sums up all the individual component variables of the index

87
New cards

Kling index

standardizes each component variable before summing them

88
New cards

Anderson index

like a Kling index, but w/ variable weighting

89
New cards

first principle component index

finds line of best fit and the variation from that line for each component variable

90
New cards

minimum outcomes index

sets the index score to the lowest value among the set of component variable outcomes (e.g. you're only as healthy as your least healthiest body part, so that body's part's score is your overall health score)

91
New cards

kernel density function

the estimate of the unknown PDF of a random variable based on a finite sample of data points; for each data point in the sample, a kernel function is placed centered at that point, then these individual kernel functions are summed and normalized to create a smooth curve that estimates the underlying probability distribution