Data Mining Quiz 1

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/29

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

30 Terms

1
New cards

Linear Regression

  • Predicts a quantitative response

  • Assumes a linear relationship between predictor variables and the response variable

2
New cards

Parametric method

Estimate the parameters by minimizing the residual

3
New cards

𝛽0

Intercept

Starting point of the line

Unknown and Estimated

4
New cards

𝛽1

Slope

Unkown and estimated

5
New cards

Simple Linear Regression

Predicts a quantitative (numeric) response Y using a single predictor (independent variable) X

6
New cards

^

Indicates predicted value

7
New cards

ϵ

Error term (the difference between the predicted and actual Y)

8
New cards

Estimating Residuals

Residual = Actual Y – Predicted Y^

9
New cards

Residual Sum of Squares

We square the residuals so negatives don’t cancel out and to penalize large errors more heavily

10
New cards

Sample Mean

Average among the given samples, measurable, but generally a good estimate of the population mean

11
New cards

Population Mean

Average among the entire population, not always measurable

12
New cards

Standard Error (SE)

Statistic that measures the uncertainty of using the sample mean to estimate the population mean

13
New cards

Confidence Intervals

Region in which we are confident that n% of the population lies within

Typically confidence level is 95%

14
New cards

Null Hypothesis (H0)

No hypothesis between X and Y
H0 : 𝛽 ^ 1 = 0

15
New cards

Alternative Hypothesis

There exists a relationship between X and Y
Ha : 𝛽 ^ 1 != 0

16
New cards

T-statistic

  • Measures the number of standard deviations away 𝛽 ^ 1 is from 0

  • Essentially is a ratio

  • Larger ratio, further away

17
New cards

P-value

  • Probability of observing the given t-statistic (or larger)

  • The smaller the p-value, the less likely this observed association between X and Y occurred randomly

  • Can reject the null hypothesis (ie, claim there is a relationship) if the p-value is small enough

  • Typically 5% or 1% cut off (stylized as p < 0.05 or p < 0.01)

18
New cards

Residual Standard Error (RSE)

  • Standard deviation of the error

  • “Lack of fit”

  • Units of Y (number of units sold)

19
New cards

R squared statistic

  • Proportion of variance explained

  • [0,1], 1 is perfect fit

  • TSS - total sum of squares

20
New cards

Coefficient interpretation

Average effect on Y of one unit increase in Xi holding all other Xs fixed

21
New cards

Check all coefficients

  • H0 : All coefficients are zero

  • 𝛽 ^ 1 = 𝛽 ^ 2 = … = 𝛽 ^ p = 0

  • Ha : At least one [𝛽 ^ 1 , 𝛽 ^ 2 , … , 𝛽 ^ p ] is non-zero

22
New cards

Check subset of coefficients

  • H0 : All coefficients except those in q are zero

  • 𝛽 ^ p-q+1 = 𝛽 ^ p-q+2 = … = 𝛽 ^ p = 0

  • Ha : At least one in subset of [𝛽 ^ p-q+1 , 𝛽 ^ p-q+2 , … , 𝛽 ^ p ] is non-zero

23
New cards

F-Statistic

  • Close to 1, no relationship between all predictors and Y

  • Far greater than one, relationship exists between at least one of the predictors and Y

24
New cards

p >> n

Too many coefficients to predict, not enough samples

  • You want to predict exam scores (YYY) from 100 predictors (study time, sleep hours, diet habits, stress levels, etc).

  • But you only have 10 students’ data (n=10n = 10n=10).

Here, p=100p = 100p=100, n=10n = 10n=10. Since p>np > np>n, the model has too many parameters and not enough data to reliably estimate them.

25
New cards

Forward Selection

  • Begin with null model (intercept/𝛽0 ) only. Fit simple linear regression for each predictor, add only the predictor that has lowest RSS. Model now has two coefficients. Continue until stopping parameter.

  • Can always be used

26
New cards

Backward Selection

  • Begin with model containing all predictors/coefficients. Remove predictors one by one, beginning with the one with the largest p-value

  • Cannot be used with p >> n

27
New cards

Mixed Selection

  • Combination of the two

  • Check that all predictors included have a low p-value, and all predictors would have a high p-value if added to the model

28
New cards

Qualitative Variables

Predictors with two levels (binary)

○ Create a new dummy variable that captures information

○ One hot encoding

Predictors with more than two levels/n levels (categorical)

○ Create n-1 dummy variables

○ Select one class to be the “neutral”

29
New cards

Additive Assumption

○ The association between a predictor Xj and the response Y does not depend on the values of the other predictors

○ Can relax the additive assumption by adding interaction terms “synnergy”

30
New cards

Linear Assumption

○ The change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj

○ Can relax the linear assumption by adding polynomial terms

○ Still technically a linear model, but has a quadratic shape