Intro Data Science Final

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

50 Terms

New cards

t-statistic represents

how many standard errors the estimate is from zero

New cards

omitted variable bias

arises when a relevant variable is left out
a correctly specified model should reflect a plausible set of assumptions and should be complete
very hard to detect OVB

New cards

A model with a higher R² means

the model explains more of the variance in the dependent variable

New cards

data mining

selectively reporting significant results

scouring through datasets for (partial) correlations from which to develop a thesis

New cards

how can outliers and violations of normality affect statistical inference

outliers and non-normal data can distort estimates and invalidate hypothesis tests. Check residual plots, use transformations, or robust methods

New cards

consider a regression output where R² = 0.85 and TSS = 1000, what is the ESS?

ESS = R² x TSS = 0.85 × 1000 = 850

New cards

three main types of distributional tests

seeing where a value fits within a distribution (determines how unusual or typical a single value is compared to the rest of the data)
making a certainty estimate from a sample to the general population (estimating population parameters from a sample, larger the sample the better our estimates)
comparing two sample means (involves testing whether two groups are significantly different from each other)

New cards

z-score

number of standard deviations away from the mean

New cards

confidence intervals

error bars, for example at the 95% confidence interval, it is the range of values that has a 95% probability of containing the measure you are interested in
how do we check if two proportions are statistically significantly different? - find both confidence intervals and check if they overlap

New cards

standard error formula

standard deviation of the sample mean

New cards

what question do we answer with a t test

is the difference between the two statistically significant?

New cards

what are all the explanations of correlation between A and B?

A causes B (possibly through C)

B causes A (possibly through C)

C causes both A and B

C causes B, and A is only spuriously correlated

New cards

What other misspecifications could there be to a set of data?

omitted variable bias
included variable bias
normality and outliers
data mining
interaction effects
casual endogeneity
ecological fallacy

New cards

included variable bias

the addition of multiple (usually irrelevant) variables to obtain the desired result

New cards

normality and outliers

OLS can only analyze variables that follow a (close to) normal distribution - data can still be transformed
make sure you eliminate outliers

New cards

How is the margin of error affected by sample size

The margin of error decreases as the sample size increases.

New cards

inferential statistics

using sample data to make inferences about a larger populationd

New cards

distributional tests

statistical tools that help determine how data points relate to theoretical probability distributions

New cards

OLS (ordinary least squares) multivariate analysis

statistical technique used to estimate the relationship between one dependent variable and two or more independent variables, extension of a simple linear regression

New cards

R² formula

1 - (residual sum of squares / total sum of squares)

New cards

TSS (total sum of squares)

measures the total variation in the dependent variable
quantifies how spread out the observed data are before considering any model

New cards

ESS (expected sum of squares)

a quantity used in describing how well a model, often a regression model, represents the data being modelled

New cards

RSS (residual sum of squares)

measures the unexplained variation
difference between point and line of best fit

New cards

level-level

x increases by one unit, y increases by the coefficient of x (units)

New cards

log-level

x increases by one unit, y increases by the coefficent x100 (%)

New cards

level-log

x increases by one 1%, y increases by coefficient /100 (%)

New cards

log-log

x increases by 1%, y increases by coefficient (%)

New cards

Regression process

descriptive statistics, regression assumptions, run reg analysis y var first, check statistical significance, interpret, repeat, choose correct model

New cards

5 OLS Assumptions

no multicollinearity, no autocorrelation, linear, homoskedasticity, normal distribution

New cards

No multicollinearity

no two independent variables are highly correlated, vif test < 5

New cards

linearity

only linear relationships, makes coefficients easier to calculate and understand

New cards

normality

normal distribution in order to calculate statistical significance

New cards

homoskedasticity

constant variance in residuals, using rvf plot or yline(0) on stata

New cards

no autocorrelation

cases are independent observations

New cards

statistical significance

p value of .1 or less

New cards

5 types of text analysis

concordance, collocation, significant, named entity, sentiment

New cards

significant terms

(term frequency - inverse document frequency)
higher TF - IDF = more significant
applications: machine learning, search engines, text summarizing

New cards

named entity recognition analysis

used to identify people, places, important dates, organizations, objects
high degree of accuracy
used in search engines as with significant terms analysis
summarization, job hiring, project selection

New cards

what does sentiment analysis need to be effective?

the combination of the other forms of text analysis

New cards

concordance

refers to a list of all occurences of a particular word or phrase in a text, along with the immediate context surrounding each occurrence
useful for pattern recognition, semantic analysis, feature extraction, information retrieval, language modeling
useful for public opinion

New cards

collocation

concepts in a text that cannot be expressed in a single word
collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword - words close to the keyword are called collocates
space to either side of the keyword is the window
important because it allow us to web of connectively between people, places, ideas, technologies, and values

New cards

RCT

scientific method to establish causation, helps overcome endogeneity problems in research
key components: random assignment, participants randomly divided into groups, treatment group(s), control group
advantages: establishes casual relationships, reduces selection bias, controls for confounding variables, replicable results

New cards

common RCT pitfalls checklist

no true random assignment, selection bias, small sample size, no control group, observer bias, Hawthorne effect, insufficient timeline, contamination between groups, poor measurement criteria, lack of blinding

New cards

ecological fallacy

logical error where you wrongly assume that trends or characteristics seen in a group also apply to every individual within that group, even though group-level averages can hide individual variations
issue when the main result is not the “true” result at subset levels, but can still be useful to infer some items
look at control variables, baseline, and/or lack of independency in observations

New cards

interaction effects

casual effects can sometimes depend on the combination of two (or more) factors
in general, social reality is full of interaction effects, making it difficult to model

New cards

casual endogeneity

an exogenous variable is one that is not determined by anything within the model (weather) while an endogenous variable is determined by other variables in our model
the most serious problem is where there is reverse causation back from our outcome (dependent) variable back to the independent variables
can overcome it using time series, creating a RCT, use other techniques like natural experiements

New cards

r vs r² vs adjusted r²

r (correlation coefficient) measures the magnitude and direction of a linear relationship [-1,1]
r² (coefficient of determination) measures the magnitude and proportion of variance [0,1]
adjusted r² accounts for the number of predictors and sample size

New cards

units

if variable is logged, use %. if variable is in %, use % points

New cards

data visualization layering

technique used to increase dimensions of data that you display, adding as much relevant info (color, size, time)

New cards

data cleaning steps

initial data inspection
handling missing data
data type corrections
dealing with duplicates
standardizing variables
handling outliers
data validation
creating new variables
data transformation
merging and appending
final data check and documentation