Intro Data Science Final

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/49

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

50 Terms

1
New cards

t-statistic represents

how many standard errors the estimate is from zero

2
New cards

omitted variable bias

  • arises when a relevant variable is left out

  • a correctly specified model should reflect a plausible set of assumptions and should be complete

  • very hard to detect OVB

3
New cards

A model with a higher R² means

the model explains more of the variance in the dependent variable

4
New cards

data mining

  • selectively reporting significant results

  • scouring through datasets for (partial) correlations from which to develop a thesis

5
New cards

how can outliers and violations of normality affect statistical inference

outliers and non-normal data can distort estimates and invalidate hypothesis tests. Check residual plots, use transformations, or robust methods

6
New cards

consider a regression output where R² = 0.85 and TSS = 1000, what is the ESS?

ESS = R² x TSS = 0.85 × 1000 = 850

7
New cards

three main types of distributional tests

  • seeing where a value fits within a distribution (determines how unusual or typical a single value is compared to the rest of the data)

  • making a certainty estimate from a sample to the general population (estimating population parameters from a sample, larger the sample the better our estimates)

  • comparing two sample means (involves testing whether two groups are significantly different from each other)

8
New cards

z-score

number of standard deviations away from the mean

9
New cards

confidence intervals

  • error bars, for example at the 95% confidence interval, it is the range of values that has a 95% probability of containing the measure you are interested in

  • how do we check if two proportions are statistically significantly different? - find both confidence intervals and check if they overlap

10
New cards

standard error formula

standard deviation of the sample mean

11
New cards

what question do we answer with a t test

is the difference between the two statistically significant?

12
New cards

what are all the explanations of correlation between A and B?

A causes B (possibly through C)

B causes A (possibly through C)

C causes both A and B

C causes B, and A is only spuriously correlated

13
New cards

What other misspecifications could there be to a set of data?

  • omitted variable bias

  • included variable bias

  • normality and outliers

  • data mining

  • interaction effects

  • casual endogeneity

  • ecological fallacy

14
New cards

included variable bias

the addition of multiple (usually irrelevant) variables to obtain the desired result

15
New cards

normality and outliers 

  • OLS can only analyze variables that follow a (close to) normal distribution - data can still be transformed 

  • make sure you eliminate outliers

16
New cards

How is the margin of error affected by sample size

The margin of error decreases as the sample size increases.

17
New cards

inferential statistics

using sample data to make inferences about a larger populationd

18
New cards

distributional tests

statistical tools that help determine how data points relate to theoretical probability distributions

19
New cards

OLS (ordinary least squares) multivariate analysis

statistical technique used to estimate the relationship between one dependent variable and two or more independent variables, extension of a simple linear regression

20
New cards

R² formula

1 - (residual sum of squares / total sum of squares)

21
New cards

TSS (total sum of squares)

  • measures the total variation in the dependent variable

  • quantifies how spread out the observed data are before considering any model

22
New cards

ESS (expected sum of squares)

  • a quantity used in describing how well a model, often a regression model, represents the data being modelled

23
New cards

RSS (residual sum of squares)

  • measures the unexplained variation

  • difference between point and line of best fit

24
New cards

level-level

x increases by one unit, y increases by the coefficient of x (units)

25
New cards

log-level

x increases by one unit, y increases by the coefficent x100 (%)

26
New cards

level-log

x increases by one 1%, y increases by coefficient /100 (%)

27
New cards

log-log

x increases by 1%, y increases by coefficient (%)

28
New cards

Regression process

descriptive statistics, regression assumptions, run reg analysis y var first, check statistical significance, interpret, repeat, choose correct model

29
New cards

5 OLS Assumptions

no multicollinearity, no autocorrelation, linear, homoskedasticity, normal distribution

30
New cards

No multicollinearity

no two independent variables are highly correlated, vif test < 5

31
New cards

linearity

only linear relationships, makes coefficients easier to calculate and understand

32
New cards

normality

normal distribution in order to calculate statistical significance

33
New cards

homoskedasticity

constant variance in residuals, using rvf plot or yline(0) on stata

34
New cards

no autocorrelation

cases are independent observations

35
New cards

statistical significance 

p value of .1 or less

36
New cards

5 types of text analysis

concordance, collocation, significant, named entity, sentiment

37
New cards

significant terms

  • (term frequency - inverse document frequency)

  • higher TF - IDF = more significant

  • applications: machine learning, search engines, text summarizing

38
New cards

named entity recognition analysis

  • used to identify people, places, important dates, organizations, objects

  • high degree of accuracy

  • used in search engines as with significant terms analysis

  • summarization, job hiring, project selection

39
New cards

what does sentiment analysis need to be effective?

the combination of the other forms of text analysis

40
New cards

concordance

  • refers to a list of all occurences of a particular word or phrase in a text, along with the immediate context surrounding each occurrence

  • useful for pattern recognition, semantic analysis, feature extraction, information retrieval, language modeling

  • useful for public opinion

41
New cards

collocation

  • concepts in a text that cannot be expressed in a single word 

  • collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword - words close to the keyword are called collocates 

  • space to either side of the keyword is the window

  • important because it allow us to web of connectively between people, places, ideas, technologies, and values

42
New cards

RCT

  • scientific method to establish causation, helps overcome endogeneity problems in research

  • key components: random assignment, participants randomly divided into groups, treatment group(s), control group

  • advantages: establishes casual relationships, reduces selection bias, controls for confounding variables, replicable results

43
New cards

common RCT pitfalls checklist

no true random assignment, selection bias, small sample size, no control group, observer bias, Hawthorne effect, insufficient timeline, contamination between groups, poor measurement criteria, lack of blinding

44
New cards

ecological fallacy

  • logical error where you wrongly assume that trends or characteristics seen in a group also apply to every individual within that group, even though group-level averages can hide individual variations

  • issue when the main result is not the “true” result at subset levels, but can still be useful to infer some items

  • look at control variables, baseline, and/or lack of independency in observations

45
New cards

interaction effects

  • casual effects can sometimes depend on the combination of two (or more) factors

  • in general, social reality is full of interaction effects, making it difficult to model

46
New cards

casual endogeneity

  • an exogenous variable is one that is not determined by anything within the model (weather) while an endogenous variable is determined by other variables in our model

  • the most serious problem is where there is reverse causation back from our outcome (dependent) variable back to the independent variables

  • can overcome it using time series, creating a RCT, use other techniques like natural experiements

47
New cards

r vs r² vs adjusted r²

  • r (correlation coefficient) measures the magnitude and direction of a linear relationship [-1,1]

  • r² (coefficient of determination) measures the magnitude and proportion of variance [0,1]

  • adjusted r² accounts for the number of predictors and sample size

48
New cards

units

if variable is logged, use %. if variable is in %, use % points

49
New cards

data visualization layering

technique used to increase dimensions of data that you display, adding as much relevant info (color, size, time)

50
New cards

data cleaning steps

  1. initial data inspection

  2. handling missing data

  3. data type corrections

  4. dealing with duplicates

  5. standardizing variables

  6. handling outliers

  7. data validation

  8. creating new variables

  9. data transformation

  10. merging and appending

  11. final data check and documentation