Data Mining Quiz 1

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/29

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

30 Terms

New cards

Linear Regression

Predicts a quantitative response
Assumes a linear relationship between predictor variables and the response variable

New cards

Parametric method

Estimate the parameters by minimizing the residual

New cards

𝛽0

Intercept

Starting point of the line

Unknown and Estimated

New cards

𝛽1

Slope

Unkown and estimated

New cards

Simple Linear Regression

Predicts a quantitative (numeric) response Y using a single predictor (independent variable) X

New cards

Indicates predicted value

New cards

Error term (the difference between the predicted and actual Y)

New cards

Estimating Residuals

Residual = Actual Y – Predicted Y^

New cards

Residual Sum of Squares

We square the residuals so negatives don’t cancel out and to penalize large errors more heavily

New cards

Sample Mean

Average among the given samples, measurable, but generally a good estimate of the population mean

New cards

Population Mean

Average among the entire population, not always measurable

New cards

Standard Error (SE)

Statistic that measures the uncertainty of using the sample mean to estimate the population mean

New cards

Confidence Intervals

Region in which we are confident that n% of the population lies within

Typically confidence level is 95%

New cards

Null Hypothesis (H0)

No hypothesis between X and Y
H0 : 𝛽 ^ 1 = 0

New cards

Alternative Hypothesis

There exists a relationship between X and Y
Ha : 𝛽 ^ 1 != 0

New cards

T-statistic

Measures the number of standard deviations away 𝛽 ^ 1 is from 0
Essentially is a ratio
Larger ratio, further away

New cards

P-value

Probability of observing the given t-statistic (or larger)
The smaller the p-value, the less likely this observed association between X and Y occurred randomly
Can reject the null hypothesis (ie, claim there is a relationship) if the p-value is small enough
Typically 5% or 1% cut off (stylized as p < 0.05 or p < 0.01)

New cards

Residual Standard Error (RSE)

Standard deviation of the error
“Lack of fit”
Units of Y (number of units sold)

New cards

R squared statistic

Proportion of variance explained
[0,1], 1 is perfect fit
TSS - total sum of squares

New cards

Coefficient interpretation

Average effect on Y of one unit increase in Xi holding all other Xs fixed

New cards

Check all coefficients

H0 : All coefficients are zero
𝛽 ^ 1 = 𝛽 ^ 2 = … = 𝛽 ^ p = 0
Ha : At least one [𝛽 ^ 1 , 𝛽 ^ 2 , … , 𝛽 ^ p ] is non-zero

New cards

Check subset of coefficients

H0 : All coefficients except those in q are zero
𝛽 ^ p-q+1 = 𝛽 ^ p-q+2 = … = 𝛽 ^ p = 0
Ha : At least one in subset of [𝛽 ^ p-q+1 , 𝛽 ^ p-q+2 , … , 𝛽 ^ p ] is non-zero

New cards

F-Statistic

Close to 1, no relationship between all predictors and Y
Far greater than one, relationship exists between at least one of the predictors and Y

New cards

p >> n

Too many coefficients to predict, not enough samples

You want to predict exam scores (YYY) from 100 predictors (study time, sleep hours, diet habits, stress levels, etc).
But you only have 10 students’ data (n=10n = 10n=10).

Here, p=100p = 100p=100, n=10n = 10n=10. Since p>np > np>n, the model has too many parameters and not enough data to reliably estimate them.

New cards

Forward Selection

Begin with null model (intercept/𝛽0 ) only. Fit simple linear regression for each predictor, add only the predictor that has lowest RSS. Model now has two coefficients. Continue until stopping parameter.
Can always be used

New cards

Backward Selection

Begin with model containing all predictors/coefficients. Remove predictors one by one, beginning with the one with the largest p-value
Cannot be used with p >> n

New cards

Mixed Selection

Combination of the two
Check that all predictors included have a low p-value, and all predictors would have a high p-value if added to the model

New cards

Qualitative Variables

Predictors with two levels (binary)

○ Create a new dummy variable that captures information

○ One hot encoding

Predictors with more than two levels/n levels (categorical)

○ Create n-1 dummy variables

○ Select one class to be the “neutral”

New cards

Additive Assumption

○ The association between a predictor Xj and the response Y does not depend on the values of the other predictors

○ Can relax the additive assumption by adding interaction terms “synnergy”

New cards

Linear Assumption

○ The change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj

○ Can relax the linear assumption by adding polynomial terms

○ Still technically a linear model, but has a quadratic shape