Exam 3 Practice

ABOUT EXAM #3

• In class on Thursday, Apr. 30

• Paper test

• No computer, no notes, no phones

• You can use a calculator that is not capable of internet access (but only to assist in computations), but you do not need a calculator.

All you need bring is a writing utensil

• Questions will be similar to HW3

REGRESSION AND CORRELATION

Simple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜖, where

• 𝛽0 = y-intercept

• 𝛽1 = slope

• 𝜖 = random error (often assumed to be normal with mean = 0 and unknown standard deviation = 𝜎).

Least squares line (“fitted equation”): 𝑌hat = 𝛽̂0hat + 𝛽̂1hat𝑥, where

• 𝛽̂0hat = yBar − 𝛽̂1Hat𝑥̅ = Avg. y – (slope)(Avg. x)

• 𝛽̂1hat = 𝑟 * 𝑠y/sx = (correlation)(SD of y/SD of x)

Multiple Linear Regression:

• Model: 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 + 𝜖𝜖

• Fitted equation: 𝑌𝑌� = 𝛽𝛽̂0 + 𝛽𝛽̂1𝑥𝑥1 + 𝛽𝛽̂2𝑥𝑥2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥𝑘𝑘

Residual = actual value − fitted = y - yhat

Regression Hypothesis Test

Is 𝑥𝑖 useful for predicting y (in a model that contains the other x variables)?

• Test H0: 𝛽𝑖 = 0 (𝑥𝑖 is NOT useful) vs. HA: 𝛽𝑖 ≠ 0 (𝑥𝑖 is useful)

• t-tests shown by R on the regression summary

Is the model useful for predicting y?

• Test H0: 𝛽1 = 𝛽2= ... = 𝛽𝑘 = 0 (model is NOT useful) vs. HA: At least one inequality (model is useful)

• F-test shown by R on the regression summary

In the case of simple linear regression, the t-test for utility of x and the F-test for model utility are equivalent.

BAYES RULE

D = Has condition (or disease, etc.)

DC = Does NOT have the condition

T = Tests positive (test says has the condition)

TC = Tests negative (tests says does not have the condition)

Prior probabilities

• P(D) = prevalence

• P(DC) = 1 – P(D)

Conditional:

• P(T|D) = true positive = sensitivity

• P(TC|D) = 1 - P(T|D) = false negative

• P(T|DC) = false positive

• P(TC|DC) = 1 - P(T|DC) = true negative = specificity

Intersections

• P(T and D) = P(T|D)P(D)

• P(T and DC) = P(T|DC)P(DC)

• P(T) = P(T and D) + P(T and DC)

Bayes Rule

• P(D|T) = P(T and D) / P(T) = P(T and D) / [P(T and D) + P(T and DC)]

CLASSIFICATION

• Response variable is categorical

• Classify based on “nearby” similar, known points.

• kNN ==> Look at the k nearest neighbors and classify based on majority opinion of those k points.

Formulas included on the front of the exam

REGRESSION AND CORRELATION

Simple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜖

Least squares line (“fitted equation”): 𝑌Hat = 𝛽0Hat + 𝛽1Hat𝑥

𝛽0Hat = yBar − 𝛽1hat𝑥̅ = Avg. y – (slope)(Avg. x)

𝛽1Hat = 𝑟 * 𝑠y/sx = (correlation)(SD of y/SD of x)

Multiple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑘𝑥𝑘 + 𝜖

Residual = actual value − fitted = y - yHat

t-tests are for testing single variable utility

F-test is for test model utility

BAYES RULE – Set up

D = Has condition (or disease, etc.)

DC = Does NOT have the condition

T = Tests positive

TC = Tests negative

P(D) = prevalence

P(DC) = 1 – P(D)

BAYES RULE – Conditional Probabilities

P(T|D) = true positive = sensitivity

P(TC|D) = 1 - P(T|D) = false negative

P(T|DC) = false positive

P(TC|DC) = 1 - P(T|DC) = true negative = specificity

BAYES RULE - Intersections

P(T and D) = P(T|D)P(D)

P(T and DC) = P(T|DC)P(DC)

P(T) = P(T and D) + P(T and DC)

BAYES RULE - Application

P(D|T) = P(T and D) / P(T) = P(T and D) / [P(T and D) + P(T and DC)]

Problem 1. An economics course grades based on x = homework score (data

shows sample average score of 80 and standard deviation 10) and y = final exam

score (data shows sample average score of 67 and standard deviation 8). The

sample correlation between homework score and final exam score is .55.

a) What is the equation of the least squares line?

b1Hat = .55 * (8/10) = .44

B0Hat = 67 − .44(80) = Avg. y – (slope)(Avg. x) = 31.8

y = 31.8 + .44x

(b) Regardless of what you actually got, suppose the least squares line from part (a)

were 𝒚Hat = 30 + .4x .Use this line to predict the final exam score for a student whose

homework score was 80.

yHat = 62

Problem 2. For a particular data set with two variables (x and y), the least squares

line was found to be y = 100 + 20x. Consider three points (x,y):

Point #1: (10,290); Point #2: (15, 417); Point #3: (12, 300)

Compute the residual for each of the three points.

  1. 290 - 300 = -10

  2. 417 - 400 = 17

  3. 300 - 340 = -40

Problem 3. Provide a real-world example (that has not been used in this course) in

which two variables are correlation but do not possess a causal relationship.

Explain.

Ice cream sales and drowning deaths are correlated because both rise during

summer due to hot weather. However, eating ice cream does not cause drowning.

The confounding variable is temperature.

Problem 4. Social media use and mental health outcomes are often found to be

correlated. Consider x = time spent on social media and y = self-reported anxiety

level (numeric; higher = greater anxiety). What kind of correlation might you expect?

Why?

Expected correlation: Positive

Reason: More time on social media can lead to increased anxiety from social

comparison, negative content exposure, and sleep disruption.

Problem 5. A data scientist is analyzing the relationship between the number of

hours students spend studying per week (HoursStudied) and their final exam scores

(ExamScore) in an introductory economics course. Use the R output below to

answer the following question.

===============================================================

R Output:

lm(formula = ExamScore ~ HoursStudied)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 62.405 2.591 24.09 < 2e-16 ***

HoursStudied 1.875 0.432 4.34 0.00006 ***

=========================================================

For using HoursStudied to predict ExamScore, does the simple linear regression

model appear to be useful? Be sure to state the hypotheses, state your conclusion,

and justify your answers.

H0: B1 = 0

HA: B1 ≠ 0

p-value: .00006 < 0.05

Reject H0, There is evidence that hours study does affect exam scores

Problem 6. A filter identifies 95% of spam emails correctly and mislabels 3% of

non-spam emails as spam. If spam makes up 20% of all emails, what is the

probability that an email marked as spam truly is spam?

P(Spam|Positive) = [P(Positive|Spam) * P(Spam)] / P(Positive)

P(Positive) = (0.95)(0.20) + (0.03)(0.80) = 0.19 + 0.024 = 0.214

==> P(Spam|Positive) = (0.95)(0.20) / 0.214 ≈ 0.888 (88.8%)

Problem 7. Consider a test used to diagnose a disease. The test has a sensitivity of

95% and a specificity of 90%. If the disease prevalence is 2%, find the probability

that a person testing positive actually has the disease.

P(D|P) = P(P|D)*P(D)/P(P)

P(P|D) = .95

P(PC | DC) = .9

P(D|P) = .95 * .02 / (.95 * .02 + .98 * .10) = 0.16239

Problem 8. In your own words, explain Bayes’ rule. Why is it useful in assessing the

importance of a “positive test” for a condition?

Bayes’ rule updates the probability of an event based on new evidence and provides

a framework for reversing the order of conditional probabilities (i.e., you know P(T|D)

but want to know P(D|T)). It helps us better interpret positive tests, especially when

diseases or conditions are rare. For a rare condition, the majority of positive tests

may be false-positives which means that testing positive may still reflect a low

probability of actually having the condition.

Problem 9. Provide a real-world example (that has not been used in this course) to

which Bayes’ rule might be applied. In the context of that example, define what are

meant by prevalence, sensitivity, and specificity.

Many possibilities. This is just one example.

Airport security detecting excess liquids (beyond allowed limit).

- Prevalence: Probability of a randomly selected passenger carrying excess liquids.

- Sensitivity: Probability that a person carrying excess liquids is detected.

- Specificity: Probability that a non-carrying person is correctly cleared.

Problem 10: A medical diagnostic test for a particular disease is “95% accurate.”

This means, if someone who has the disease is tested, the probability of a positive

test is .95 and, if a person who does not have the disease is tested, the probability

of a negative test is .95. If a person tests positive, should they be concerned? What

additional information would they need to know in order to properly judge their risk?

The person should not necessarily be concerned. We need to know P(D|T) to assess

the ramifications of a positive test (T). To compute this using Bayes rule, we need to

know P(D) = prevalence of the disease

Problem 6. A filter identifies 95% of spam emails correctly and mislabels 3% of

non-spam emails as spam. If spam makes up 20% of all emails, what is the

probability that an email marked as spam truly is spam?

P(S | MS) = P(MS | S) * P(S) / P(MS)

P(MS) = .95(.2) + .03(.8) = 0.214

P(S | MS) = .95 * .2 / .214 = .888

Problem 7. Consider a test used to diagnose a disease. The test has a sensitivity of

95% and a specificity of 90%. If the disease prevalence is 2%, find the probability

that a person testing positive actually has the disease.

P(D | T) = P(T | D) * P(D) / P(T)

P(T) = (.95*.02 + .10 * .98) = .117

P(D|T) = .95 * .02 / .117 = 0.16239