Data Science Final

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/61

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

62 Terms

1
New cards

Population

The entire group you want to learn about

2
New cards

Sample

The subset you actaully observe

3
New cards

Parameter

A number describiing the population (usually unknown)

4
New cards

Statistic

A number calculated from the sample (used to estimate parameters)

5
New cards

Population distribution

Distribution of Individual values in the population

6
New cards

Sample distribution

distribution of individual values in your sample

7
New cards

Sampling distribution

distribution of statistic across many possible samples

8
New cards

Central Limit Theorem (CLT)

For sufficiently large samples, the sampling distribution of the sample mean is approximately normal, regardless of the shape of the population distribution

9
New cards

Standard error

standard deviation of the sampling distribution

10
New cards

Normal distribution

Bell-shaped and symmetric, mean = median = mode, completely determined by two parameters: mu (center) and sigma (spread), notation: X ~ N (mu, sigma²)

11
New cards

Point estimate

Sample mean in confidence intervals

12
New cards

Confidence interval

A range of values calculated from sample data that are likely to contain the true, unknown population parameter with a specific level of confidence

13
New cards

Confidence level

The confidence level is the percentage of times you expect to get close to the same estimate if you run your experiment again or resample the population in the same way.

14
New cards

Margin of error

A statistic showing how much data may differ from the true population using + and - percentage

15
New cards

Critical value

The cutoff point from a probability distribution (like the z- or t-distribution) that determines how far the sample statistic can deviate from the population parameter while still being consistent with a specified confidence level. The multiplier.

16
New cards

Null hypothesis

The skeptic’s position; what we’re trying to disprove

17
New cards

Alternative hypothesis

The research claim; what we want to show

18
New cards

Test statistic

Measures how many standard errors our estimate is from the null value

19
New cards

P-value

The probability of data this extreme is Ho is true

20
New cards

Significance level

The threshold for determining if a result is statistically significant in a hypothesis test

21
New cards

Type 1 error

False positive (rejecting a true Ho)

22
New cards

Type 2 error

False negative (failing to reject a false Ho)

23
New cards

One-sided test

Checks for a difference in a specific direction. The unknown true population mean is either specifically higher or lower than the null hypothesis mean.

24
New cards

Two-sided test

Checks for any difference. The unknown true population mean is not the same as the null hypothesis mean, but the direction is not specified.

25
New cards

Statistical signficance

The result’s p-value is less than the alpha of 0.05.

26
New cards

Practical significance

Is the effect of statistical significance large enough to matter outside of the study?

27
New cards

Correlation

Measures the strength and direction of the linear relationship between two continuous variables. Range between 0 and 1.

28
New cards

Causation

One variable directly causes change in another.

29
New cards

Counterfactual

What would have happened under the alternative condition and it is unobservable

30
New cards

Average Treatment Effect (ATE)

The average causal effect of a treatment, calculated as the average outcome for those who received the treatment minus the average outcome for those who did not, assuming treatment assignment is independent of potential outcomes.

31
New cards

Confounding variable

A confounding variable is a third variable that is related to both the independent variable (IV) and the dependent variable (DV) and can distort the observed relationship between the IV and DV.

32
New cards

Direct causal relationship

X → Y

33
New cards

Spurious relationship

Z → X and Z → Y

34
New cards

Chain/mediation relationship

X → Z → Y

35
New cards

Internal validity

Checks if the study established causation. Did the treatment cause the effect?

36
New cards

External validity

Checks if the results of the study can be generalized in the real world

37
New cards

Selection bias

Sample selection (who volunteers), treatment selection (who seeks treatment), attritrion (who drops out)

38
New cards

Measurement validity

If your study actually captures the real-world concept. Lab vs. reality, self-report vs. behavior, proxy measures

39
New cards

Generalizability

Whether the finding will apply with different populations, different settings, and different times

40
New cards

Demand effects/reactivity

Whether people behave differently because they’re being studied. Hawthorne effects, social desirability bias, experimenter demand

41
New cards

Reverse causality

Does X cause Y or does Y cause X (or both?)

42
New cards

Correlation coefficient (r)

Gives us a single number that summarizes linear relationship between two variables

43
New cards

Intercept (a)

Tells us what value the model predicts if the dv is 0

44
New cards

Slope (b)

Tells us the amount of growth per increase of one unit

45
New cards

Residual

The difference between an actual observed data point and the value predicted by a model

46
New cards

Least squares/ols

A method for estimating a regression line by choosing the coefficients that minimize the sum of squared residuals (the squared differences between observed and predicted values)

47
New cards

is the proportion of the variation in the dependent variable (Y) that is explained by the regression model

48
New cards

Adjusted R²

is the proportion of the variation in the dependent variable (Y) that is explained by the regression model

49
New cards

Control variable

50
New cards

Fixed effects

51
New cards

Interaction term

52
New cards

Multicollinearity

53
New cards

Heteroscedasticity

54
New cards

Nonlinearity

55
New cards

Dummy variable

A binary variable used to represent categories in regression model

56
New cards

Reference category

The omitted group when using dummy variables. All dummy variable coefficients are interpreted relative to this group

57
New cards

Control variable

a variable included in a regression to account for other factors that may affect the outcome, helping isolate the relationship between the main independent variable and the dependent variable

58
New cards

Fixed Effects

Control for all unobserved, time-invariant characteristics of units by comparing each unit to itself overtime

59
New cards

Interaction term

Allows the effect of one independent variable on the dependent variable to depend on the value of another variable

60
New cards

Multicollinearity

When two or more independent variables are highly correlated, making coefficient estimates unstable and standard errors large

61
New cards

Heteroscedasticity

Occurs when the variance of the regression errors is not constant across values of the independent variables, which leads to incorrect standard errors

62
New cards

Nonlinearity

When the relationship between the independent variable and the dependent variable is not linear, meaning a straight line is not an appropriate fit