1/186
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
survivorship bias
only having data on individuals that lived through an incident
random sampling qualifications
each individual has an equal chance of being selected
selection is independent between individuals
precision
minimizing sampling error
accuracy
minimizing bias
categorical variables
qualitative groupings with no inherent magnitude on a numerical scale
nominal variables
categroical variables with no obvious order
ordinal variables
categorical varibles that have an intrinsic pattern
numeric (quantitative) variabled
have a magnitude on a numerical scale
continuous variables
numeric variables for data containing any real number within some bounds
discrete variables
numeric variables for data containing only whole numbers
ratio vs interval scale
ratio scales have a true zero point representing the absence of the variable, while interval scales do not.
frequency distribution
the number if times each value occurs in a sample
histograms
Descriptive Statistics
quantities that capture features of the sample data
arithmetic mean
regular old averaging
very impacted by outliers
median
middlemost measurement
somewhat resistent to outliers
geometric mean
used to summarize data when variables are multiplicative
standard deviation
square root of variance
s = sqrt(s2)
variance
spread of data
s2=(sum(observation - mean)2)/(n-1)
coefficient of variation
used for ratio scale variables (spread is expected to increase with mean)
CV=s/sample mean
IQR
range with middle 50% of the data
useful for skewed data
sampling distribution
distribution of values for an estimate that we might obtain with repeated sampling of a population
standard error
standard deviation of sampling distribution
SE(mean) = s/sqrt(n)
Confidence Interval
range of values that are likely to contain the target parameter
sample mean ± 1.96 * SE(mean) for normally distributed data
Probability
proportion of times event occurs if repeating random trial many times
value between 0 and 1, where probabilities of all possibilities must sum to 1
Probability Mass Function
probability of all possibilities for discrete data
Probability Density Function
probability of all possibilities for continuous data
Marginal Probability
the probability of a single event occurring, independent of other variables
ex. P(A)
Conditional Probability
measures the likelihood of an event occurring given that another event has already occurred
ex. P(A|B)
Union Probability
measures the likelihood that at least one of multiple events occurs
ex. P(A u B)
Compliment Probability
calculates the likelihood of an event not occurring
ex. P(Ac)
Intersection/ Joint Probability
the likelihood of two or more independent or dependent events occurring simultaneously
ex. P(A,B)
P(A and B)
P(A)*P(B)
P(A or B)
P(A) + P(B) - P(A,B)
Bayes Theorum
Bayes' Theorem helps us update probabilities based on prior knowledge and new evidence
š(š“|šµ)=(š(šµ|š“)Ćš(š“))/š(šµ)
P(A given B)
P(A,B)/P(B)
probability distribution
the mathematical function that gives the probabilities of occurance of possible outcomes
Used to: Fit models to datam represent uncertainty in parameters, and portray prior information
Normal Distribution
continuous
symmetrical around its mean
unimodal
probability density is highest at the mean
normal distribution ranges
x (-inf, +inf)
u (-inf, +inf)
sigma > 0
Z-score
test statistic for a normal distribution
(X-u)/sigma
log normal distribution
continuous
positive only
positive right skew
Log normal distribution ranges
x (0, +inf)
u (-inf, +inf)
sigma > 0
central limit theorum
the sum or mean of measurements randomly sampled from ANY distribution is approximately normally distribution
location moment
mean
spread moment
variance
symmetry moment
skewness
heavy tailed-ness moment
kurtosis
t-distribution ranges
x (-inf, +inf)
u (-inf, +inf)
sigma > 0
v (nu) > 0
poisson distribution
discrete
positive only
only one parameter (lambda)
normal is a good approximator when lambda is large
often used for counts
poisson distribution ranges
x (positive whole numbers)
lambda > 0
lambda
= mean = variance (s2)
underdispersed population
variance < mean
overdispersed
variance > mean
binomial distribution
discrete
positive only
used for success/fail trials
binomial distribution ranges
x (# of successes) (whole numbers > 0)
n (# of trials) (whole numbers > 0)
p (probability of a success) [0,1]
bernoulli distribution
a special case of binomial distribution where n (number of trials) =1
multinomial distribution
generalization of the binomial, when there are >2 categories
ex. dice
gamma distribution
continuous
positive only
flexible
multiple common parameterizations
gamma distribution ranges
x [0, +inf)
alpha (shape parameter) > 0
theta (scale parameter) > 0
Beta distribution
continuous
bound between 0 and 1
used for proportions
Beta distribution ranges
x (0,1)
alpha >0
Beta > 0
Directed Acyclic Graphs (DAGs)
can use these to denote causal relationships between variables
DAG features
node
edge
direction
a/cyclic
confounding variables
influences both the dependent and independent variables, causing a spurious correlation
Fork example
sun intensity, sunburns, ice cream sales
pipe example
day of the year, temperature, how fast ice cream melts
the collider
ice cream sales, quality of ice cream, outside temperature
hypothesis testing
compares collected data to expectations under a null hypothesis to determine how unlikely the data are
p-value
probability of obtaining a value that is as or more extreme than the observed value, given the null hypothesis is true
test statistic
value calculated from the data that is used to eveluate how unlikely your data are, given the null hypothesis is true.
p-value < a
reject the null hypothesis
š¼
probability of committing a type one error (false positive). Generally 0.05
š½
probability of committinga type two error (false negative)
p-value > š¼
The data are compatible or consistent with the null hypothesis
Confidence intervals
range of values that are likely to contain the target parameter
Permutation tests
generates a null distribution of the test statistic by repeatedly rearranging values
slightly less power than parametric tests (like t-tests) when sample sizes are small
performing a permutation test
calculate test statistic
randomly rearrange data into new groups
caclulate test statistic for permuted data
repeat 1000+ times to generate sampling distribution of test statistic under the null hypothesis
calculate p-value
Bootstrap
resampling data with replacement to approximate the sampling distribution of an estimate
useful for finding standard error or confidence interval for a parameter estimate
performing bootstrapping
sample with replacement from original sample
calculate estimated median from boootstrapped data
repeat 10000+ times
calculate SE
ways to reduce bias
control group - does not receive treatment but is exposed to the same conditions
randomization - random assignment of treatments
blinding
single blinding
hides treatment details from participants to prevent behavioral bias
double blinding
hides treatment details from participants and researchers to prevent expectancy effects
ways to reduce sampling error
replication - application of each treatment to multiple, independent units OR application of multiple, identical ttreatments to a single unit
balance - equal sample size in all groups
blocking - grouping of units that share properties (like location); within each block, treatments are randomly assigned
statistical power
probability that a random sample will lead to rejection of a false null hypothesis
1 - beta
Type 1 error
rejecting the null hypothesis when the null hypothesis is true
False +
alpha
Type 2 error
failing to reject the null hypothesis when the null hypothesis is false
False -
Beta
Statistical Power is most affected by
Sample Size and Variance of Data
chi squared goodness of fit test
compares frequency data to a probability model stated by the null hypothesis
common for hypothesis testing
chi squared test statistic (x2)
sum((observation - EV)2/EV)
chi squared distrubution
only affected by k (degrees of freedom)
k = df
k
# categories - 1 - # parameters estimated from the data
chi squared assumptions
random and independent sampling
expected frequencies >1
no more than 20% of categories should have expected frequencies < 5
one sample t-test
compares the mean of a sample with some ānull meanā
t-distribution
continuous and symmetrical. Nu controls āheavy tailed-nessā
t-distribution test statistic
(sample mean - null mean) / SE of the sample mean
T-test df
n-1 = v
Paired t-test
compares the mean difference between two sample means to a null mean
controls for variation among plots that is otherwise difficult to control for
paired t-test test statistic
(mean difference - null mean)/SE of the mean difference
Unpaired t-test
compares the difference of one samle mean from another sample mean
Unpaired t-test test statistic
(sample 1 mean - sample 2 mean)/SE of sample 1 mean - sample 2 mean
Unpaired t-test df
v = n1 + n2 - 2