1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
t-statistic represents
how many standard errors the estimate is from zero
omitted variable bias
arises when a relevant variable is left out
a correctly specified model should reflect a plausible set of assumptions and should be complete
very hard to detect OVB
A model with a higher R² means
the model explains more of the variance in the dependent variable
data mining
selectively reporting significant results
scouring through datasets for (partial) correlations from which to develop a thesis
how can outliers and violations of normality affect statistical inference
outliers and non-normal data can distort estimates and invalidate hypothesis tests. Check residual plots, use transformations, or robust methods
consider a regression output where R² = 0.85 and TSS = 1000, what is the ESS?
ESS = R² x TSS = 0.85 × 1000 = 850
three main types of distributional tests
seeing where a value fits within a distribution (determines how unusual or typical a single value is compared to the rest of the data)
making a certainty estimate from a sample to the general population (estimating population parameters from a sample, larger the sample the better our estimates)
comparing two sample means (involves testing whether two groups are significantly different from each other)
z-score
number of standard deviations away from the mean
confidence intervals
error bars, for example at the 95% confidence interval, it is the range of values that has a 95% probability of containing the measure you are interested in
how do we check if two proportions are statistically significantly different? - find both confidence intervals and check if they overlap
standard error formula
standard deviation of the sample mean
what question do we answer with a t test
is the difference between the two statistically significant?
what are all the explanations of correlation between A and B?
A causes B (possibly through C)
B causes A (possibly through C)
C causes both A and B
C causes B, and A is only spuriously correlated
What other misspecifications could there be to a set of data?
omitted variable bias
included variable bias
normality and outliers
data mining
interaction effects
casual endogeneity
ecological fallacy
included variable bias
the addition of multiple (usually irrelevant) variables to obtain the desired result
normality and outliers
OLS can only analyze variables that follow a (close to) normal distribution - data can still be transformed
make sure you eliminate outliers
How is the margin of error affected by sample size
The margin of error decreases as the sample size increases.
inferential statistics
using sample data to make inferences about a larger populationd
distributional tests
statistical tools that help determine how data points relate to theoretical probability distributions
OLS (ordinary least squares) multivariate analysis
statistical technique used to estimate the relationship between one dependent variable and two or more independent variables, extension of a simple linear regression
R² formula
1 - (residual sum of squares / total sum of squares)
TSS (total sum of squares)
measures the total variation in the dependent variable
quantifies how spread out the observed data are before considering any model
ESS (expected sum of squares)
a quantity used in describing how well a model, often a regression model, represents the data being modelled
RSS (residual sum of squares)
measures the unexplained variation
difference between point and line of best fit
level-level
x increases by one unit, y increases by the coefficient of x (units)
log-level
x increases by one unit, y increases by the coefficent x100 (%)
level-log
x increases by one 1%, y increases by coefficient /100 (%)
log-log
x increases by 1%, y increases by coefficient (%)
Regression process
descriptive statistics, regression assumptions, run reg analysis y var first, check statistical significance, interpret, repeat, choose correct model
5 OLS Assumptions
no multicollinearity, no autocorrelation, linear, homoskedasticity, normal distribution
No multicollinearity
no two independent variables are highly correlated, vif test < 5
linearity
only linear relationships, makes coefficients easier to calculate and understand
normality
normal distribution in order to calculate statistical significance
homoskedasticity
constant variance in residuals, using rvf plot or yline(0) on stata
no autocorrelation
cases are independent observations
statistical significance
p value of .1 or less
5 types of text analysis
concordance, collocation, significant, named entity, sentiment
significant terms
(term frequency - inverse document frequency)
higher TF - IDF = more significant
applications: machine learning, search engines, text summarizing
named entity recognition analysis
used to identify people, places, important dates, organizations, objects
high degree of accuracy
used in search engines as with significant terms analysis
summarization, job hiring, project selection
what does sentiment analysis need to be effective?
the combination of the other forms of text analysis
concordance
refers to a list of all occurences of a particular word or phrase in a text, along with the immediate context surrounding each occurrence
useful for pattern recognition, semantic analysis, feature extraction, information retrieval, language modeling
useful for public opinion
collocation
concepts in a text that cannot be expressed in a single word
collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword - words close to the keyword are called collocates
space to either side of the keyword is the window
important because it allow us to web of connectively between people, places, ideas, technologies, and values
RCT
scientific method to establish causation, helps overcome endogeneity problems in research
key components: random assignment, participants randomly divided into groups, treatment group(s), control group
advantages: establishes casual relationships, reduces selection bias, controls for confounding variables, replicable results
common RCT pitfalls checklist
no true random assignment, selection bias, small sample size, no control group, observer bias, Hawthorne effect, insufficient timeline, contamination between groups, poor measurement criteria, lack of blinding
ecological fallacy
logical error where you wrongly assume that trends or characteristics seen in a group also apply to every individual within that group, even though group-level averages can hide individual variations
issue when the main result is not the “true” result at subset levels, but can still be useful to infer some items
look at control variables, baseline, and/or lack of independency in observations
interaction effects
casual effects can sometimes depend on the combination of two (or more) factors
in general, social reality is full of interaction effects, making it difficult to model
casual endogeneity
an exogenous variable is one that is not determined by anything within the model (weather) while an endogenous variable is determined by other variables in our model
the most serious problem is where there is reverse causation back from our outcome (dependent) variable back to the independent variables
can overcome it using time series, creating a RCT, use other techniques like natural experiements
r vs r² vs adjusted r²
r (correlation coefficient) measures the magnitude and direction of a linear relationship [-1,1]
r² (coefficient of determination) measures the magnitude and proportion of variance [0,1]
adjusted r² accounts for the number of predictors and sample size
units
if variable is logged, use %. if variable is in %, use % points
data visualization layering
technique used to increase dimensions of data that you display, adding as much relevant info (color, size, time)
data cleaning steps
initial data inspection
handling missing data
data type corrections
dealing with duplicates
standardizing variables
handling outliers
data validation
creating new variables
data transformation
merging and appending
final data check and documentation