Gen Bus 307 (Linear Regression)

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/33

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

34 Terms

1
New cards

Supervised Learning

regroups methods that attempt to learn about
the conditional distribution

2
New cards

Conditional Distribution equation

P(Yl𝑋) = 𝑃(𝑦1, 𝑦2, ... |𝑥1, 𝑥2, ... )

3
New cards

Y variables

The outcomes, response variables, or labels.
In the case of randomized experiments, also called dependent variables.

4
New cards

X variables

the explanatory variables, predictors, or
regressors. In the case of randomized experiments, also called independent variables.

5
New cards
6
New cards

The linear regression method is based on:

  • The relationship between the expected value of Y and Xs is assumed to be linear

  • Estimates and predictions are denoted with a hat

  • The coefficients are obtained by minimizing the Sum of Squared Residuals (or “errors”)

7
New cards

expected value equation

𝐸(𝑌|𝑋) = 𝛽0 + 𝛽1x1 + 𝛽2x2 + ⋯ + 𝛽𝑛X𝑛

8
New cards

Estimates equation

ŷ= መ𝛽0 + መ𝛽1x1 + መ𝛽2x2 + ⋯ + መ𝛽𝑛𝑋𝑛

9
New cards

Sum of Squared Residuals Equation

SS𝐸𝑟𝑟 = ∑ (𝑌𝑖 − ŷ) ^2

10
New cards

𝑔𝑟𝑎𝑑𝑒 = 57 + 5.2 × 𝑆𝑡𝑢𝑑𝑦𝑇𝑖𝑚𝑒𝐻 − 8.7 #𝐶𝑙𝑎𝑠𝑠𝑆𝑘𝑖𝑝𝑝𝑒𝑑


• Intercept: Expected ŷ when every 𝑿 = 𝟎

“On average, when a student spent 0 hours studying and skipped 0 classes, we
expect their grade to be 57 points, everything else being equal.”

“On average, an increase in study time by 1 hour is associated with an increase in grade by 5.2 points, everything else being equal.”

11
New cards

If 𝑋 = 0 makes no sense or is not in the range of the data (out-of-domain), for at least
one explanatory variable

• “A study time of 0 is not in the range of our data and we shouldn’t extrapolate”
• “A study time of 0 is unrealistic and we shouldn’t extrapolate”
• “A study time of 0 is not in the range of our data and unrealistic and we shouldn’t extrapolate”

12
New cards

Slope

Expected change in ŷ for a change in the corresponding 𝑋, while every single other 𝑋 stays the same

13
New cards

𝐿𝑖𝑓𝑒 𝐸𝑥𝑝. = −2.92 + 8.1 × ln 𝐺𝐷𝑃

On average, an increase of GDP by 1% is associated with an increase in Life Expectancy by 0.081 years, everything else being equal.

14
New cards

ln(^𝐿𝑖𝑓𝑒 𝐸𝑥𝑝) = 1.23 + 0.02 × 𝐺𝐷𝑃

On average, an increase of GDP by 1 Million USD is associated with an increase in Life Expectancy by 2 %, everything else being equal.

15
New cards

ln(^𝐿𝑖𝑓𝑒 𝐸𝑥𝑝) = 1.23 + 2.1 × ln 𝐺𝐷𝑃

On average, an increase of GDP by 1 % is associated with an increase in Life Expectancy by 2.1 %, everything else being equal.

16
New cards

If a variable is standardized (mean 0 and s.dev. 1)

the change is in standard deviations

17
New cards

𝑆𝑡𝑑 𝐿𝑖𝑓𝑒 𝐸𝑥𝑝 = 1.23 + 0.1 × 𝐺𝐷𝑃

On average, an increase of GDP by 1 Million USD is associated with an increase in Life Expectancy by 0.1 Standard Deviations, everything else being equal

18
New cards

R^2

the share of variations in Y that we can explain with the model, when we know the value of every single explanatory variable

19
New cards

Example: R2 = 0.245

With this model, we can explain 24.5% of the variations in grades by looking at the variations in both the number of hours of study and in the
number of class skipped”

20
New cards

p-value

The probability that we find an estimated coefficient at least that far from the population value, if the population value were the one in H0
(usually 0).

21
New cards

Coefficient P-value (𝑯𝟎: 𝜷 = 𝟎)
Intercept < 0.001
StudyTimeH < 0.001
ClassSkipped 0.042

“If the true population coefficient 𝜷StudyTimeH= 𝟎, there is a probability of less than 0.1% that the estimated coefficient for the Study Time in hours is that far from 0.”

“If the true population coefficient 𝜷ClassSkipped = 𝟎, there is a probability of 4.2% that the estimated
coefficient for the number of class skipped is that far from 0.”

22
New cards

If p-value < a

A result is statically significant at a confidence level

23
New cards

𝑅𝑒𝑣 = 11,671,521,696 − 5,816,822 × 𝑌𝑒𝑎𝑟 + 3 × 𝐵𝑢𝑑𝑔𝑒𝑡

• What is the expected revenue for a film released in 1992 without a budget?

A film without a budget is unrealistic, we cannot extrapolate.


• What is the expected revenue for a film produced in 1970 with a $10MM budget?

11,671,521,696 − 5,816,822 × 1970 + 3 × 10,000,000 = 242,382,356

24
New cards

Incremental Value Isn’t Constant

  • In linear models, a small change in X results in a constant change in Y.

  • In non-linear models, the effect of changing X varies depending on where you are in the model. For example, in a quadratic function like Y=X2, increasing X from 1 to 2 has a smaller effect than increasing X from 5 to 6.

25
New cards

Local Incremental Changes Matter

  • Since the effect of X is not uniform, we need to consider local changes—the impact of a small increase in X at a specific point.

  • This is crucial in economics, business, and data analysis, where small changes can have different effects depending on the situation.

26
New cards

Derivative equation

ŷ = መ𝛽0 + መ𝛽1X1 + መ𝛽2log(𝑋1) + መ𝛽3𝑋1
2 + መ𝛽4 𝑋1𝑋2 + መ𝛽5 𝑋2

27
New cards

Derivative of a sum is the sum of the derivatives


𝜕ŷ1/
𝜕𝑋1= 𝜕መ𝛽0/𝜕𝑋1+ 𝜕መ𝛽1X1/𝜕𝑋1+ 𝜕መ𝛽2log(𝑋1)/𝜕𝑋1+…

28
New cards

For all values of 𝑋 the population errors

• Have mean zero (Linearity)
• Are statistically independent (Independence)
• Are normally distributed (Normality)
• Have equal variance (Equal Variance)

Can be remembered using the acronym L.I.N.E.

29
New cards

Even though we can’t know the population errors

we can guess their behavior by looking at the sample residuals

30
New cards

Mean Zero Population Errors (Linearity)

• Population errors assumed to
be mean zero for any given
𝑋s.
• There should be roughly as
many points below and above
the straight line.

31
New cards

Independent Population Errors

• Knowing the value of the errors for any (set of) 𝑋 value(s) provides no information on the value of the others errors

  • No relation between X and e

  • No relation between different e


• Commonly violated with time series
data

  • Errors may display trends over time called autocorrelation

32
New cards

Normally Distributed Population Errors

• Distribution of the errors is
normal


• Note: This is not the same as saying that Y is normally distributed


• Check by plotting the residuals

  • Histogram

  • QQ Plot


• Histogram should be (roughly)
bell shaped

33
New cards

Equal Error Variance

• The variance of the errors does not depend on the values of 𝑋

  • Variance of 𝑒 constant across 𝑋 ➔ homoskedasticity

  • Variance of 𝑒 depends on 𝑋 ➔ heteroskedasticity


• Can be seen by plotting 𝑒 on 𝑋

  • (Relatively) consistent ➔ homoskedastic

  • Fan shape ➔ heteroskedastic

34
New cards

If assumptions are violated

• Bias: Estimates and predictions might not be equal to the true value on
average.

• Wrong uncertainty estimation: The standard errors could be off

• Inefficient: Even if there were neither bias nor wrong standard errors, there might exist a more accurate method to perform estimation and
prediction.