Exam 3 Practice
ABOUT EXAM #3
• In class on Thursday, Apr. 30
• Paper test
• No computer, no notes, no phones
• You can use a calculator that is not capable of internet access (but only to assist in computations), but you do not need a calculator.
All you need bring is a writing utensil
• Questions will be similar to HW3
REGRESSION AND CORRELATION
Simple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜖, where
• 𝛽0 = y-intercept
• 𝛽1 = slope
• 𝜖 = random error (often assumed to be normal with mean = 0 and unknown standard deviation = 𝜎).
Least squares line (“fitted equation”): 𝑌hat = 𝛽̂0hat + 𝛽̂1hat𝑥, where
• 𝛽̂0hat = yBar − 𝛽̂1Hat𝑥̅ = Avg. y – (slope)(Avg. x)
• 𝛽̂1hat = 𝑟 * 𝑠y/sx = (correlation)(SD of y/SD of x)
Multiple Linear Regression:
• Model: 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘 + 𝜖𝜖
• Fitted equation: 𝑌𝑌� = 𝛽𝛽̂0 + 𝛽𝛽̂1𝑥𝑥1 + 𝛽𝛽̂2𝑥𝑥2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑥𝑥𝑘𝑘
Residual = actual value − fitted = y - yhat
Regression Hypothesis Test
Is 𝑥𝑖 useful for predicting y (in a model that contains the other x variables)?
• Test H0: 𝛽𝑖 = 0 (𝑥𝑖 is NOT useful) vs. HA: 𝛽𝑖 ≠ 0 (𝑥𝑖 is useful)
• t-tests shown by R on the regression summary
Is the model useful for predicting y?
• Test H0: 𝛽1 = 𝛽2= ... = 𝛽𝑘 = 0 (model is NOT useful) vs. HA: At least one inequality (model is useful)
• F-test shown by R on the regression summary
In the case of simple linear regression, the t-test for utility of x and the F-test for model utility are equivalent.
BAYES RULE
D = Has condition (or disease, etc.)
DC = Does NOT have the condition
T = Tests positive (test says has the condition)
TC = Tests negative (tests says does not have the condition)
Prior probabilities
• P(D) = prevalence
• P(DC) = 1 – P(D)
Conditional:
• P(T|D) = true positive = sensitivity
• P(TC|D) = 1 - P(T|D) = false negative
• P(T|DC) = false positive
• P(TC|DC) = 1 - P(T|DC) = true negative = specificity
Intersections
• P(T and D) = P(T|D)P(D)
• P(T and DC) = P(T|DC)P(DC)
• P(T) = P(T and D) + P(T and DC)
Bayes Rule
• P(D|T) = P(T and D) / P(T) = P(T and D) / [P(T and D) + P(T and DC)]
CLASSIFICATION
• Response variable is categorical
• Classify based on “nearby” similar, known points.
• kNN ==> Look at the k nearest neighbors and classify based on majority opinion of those k points.
Formulas included on the front of the exam
REGRESSION AND CORRELATION
Simple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜖
Least squares line (“fitted equation”): 𝑌Hat = 𝛽0Hat + 𝛽1Hat𝑥
𝛽0Hat = yBar − 𝛽1hat𝑥̅ = Avg. y – (slope)(Avg. x)
𝛽1Hat = 𝑟 * 𝑠y/sx = (correlation)(SD of y/SD of x)
Multiple Linear Regression Model: 𝑌 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑘𝑥𝑘 + 𝜖
Residual = actual value − fitted = y - yHat
t-tests are for testing single variable utility
F-test is for test model utility
BAYES RULE – Set up
D = Has condition (or disease, etc.)
DC = Does NOT have the condition
T = Tests positive
TC = Tests negative
P(D) = prevalence
P(DC) = 1 – P(D)
BAYES RULE – Conditional Probabilities
P(T|D) = true positive = sensitivity
P(TC|D) = 1 - P(T|D) = false negative
P(T|DC) = false positive
P(TC|DC) = 1 - P(T|DC) = true negative = specificity
BAYES RULE - Intersections
P(T and D) = P(T|D)P(D)
P(T and DC) = P(T|DC)P(DC)
P(T) = P(T and D) + P(T and DC)
BAYES RULE - Application
P(D|T) = P(T and D) / P(T) = P(T and D) / [P(T and D) + P(T and DC)]
Problem 1. An economics course grades based on x = homework score (data
shows sample average score of 80 and standard deviation 10) and y = final exam
score (data shows sample average score of 67 and standard deviation 8). The
sample correlation between homework score and final exam score is .55.
a) What is the equation of the least squares line?
b1Hat = .55 * (8/10) = .44
B0Hat = 67 − .44(80) = Avg. y – (slope)(Avg. x) = 31.8
y = 31.8 + .44x
(b) Regardless of what you actually got, suppose the least squares line from part (a)
were 𝒚Hat = 30 + .4x .Use this line to predict the final exam score for a student whose
homework score was 80.
yHat = 62
Problem 2. For a particular data set with two variables (x and y), the least squares
line was found to be y = 100 + 20x. Consider three points (x,y):
Point #1: (10,290); Point #2: (15, 417); Point #3: (12, 300)
Compute the residual for each of the three points.
290 - 300 = -10
417 - 400 = 17
300 - 340 = -40
Problem 3. Provide a real-world example (that has not been used in this course) in
which two variables are correlation but do not possess a causal relationship.
Explain.
Ice cream sales and drowning deaths are correlated because both rise during
summer due to hot weather. However, eating ice cream does not cause drowning.
The confounding variable is temperature.
Problem 4. Social media use and mental health outcomes are often found to be
correlated. Consider x = time spent on social media and y = self-reported anxiety
level (numeric; higher = greater anxiety). What kind of correlation might you expect?
Why?
Expected correlation: Positive
Reason: More time on social media can lead to increased anxiety from social
comparison, negative content exposure, and sleep disruption.
Problem 5. A data scientist is analyzing the relationship between the number of
hours students spend studying per week (HoursStudied) and their final exam scores
(ExamScore) in an introductory economics course. Use the R output below to
answer the following question.
===============================================================
R Output:
lm(formula = ExamScore ~ HoursStudied)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.405 2.591 24.09 < 2e-16 ***
HoursStudied 1.875 0.432 4.34 0.00006 ***
=========================================================
For using HoursStudied to predict ExamScore, does the simple linear regression
model appear to be useful? Be sure to state the hypotheses, state your conclusion,
and justify your answers.
H0: B1 = 0
HA: B1 ≠ 0
p-value: .00006 < 0.05
Reject H0, There is evidence that hours study does affect exam scores
Problem 6. A filter identifies 95% of spam emails correctly and mislabels 3% of
non-spam emails as spam. If spam makes up 20% of all emails, what is the
probability that an email marked as spam truly is spam?
P(Spam|Positive) = [P(Positive|Spam) * P(Spam)] / P(Positive)
P(Positive) = (0.95)(0.20) + (0.03)(0.80) = 0.19 + 0.024 = 0.214
==> P(Spam|Positive) = (0.95)(0.20) / 0.214 ≈ 0.888 (88.8%)
Problem 7. Consider a test used to diagnose a disease. The test has a sensitivity of
95% and a specificity of 90%. If the disease prevalence is 2%, find the probability
that a person testing positive actually has the disease.
P(D|P) = P(P|D)*P(D)/P(P)
P(P|D) = .95
P(PC | DC) = .9
P(D|P) = .95 * .02 / (.95 * .02 + .98 * .10) = 0.16239
Problem 8. In your own words, explain Bayes’ rule. Why is it useful in assessing the
importance of a “positive test” for a condition?
Bayes’ rule updates the probability of an event based on new evidence and provides
a framework for reversing the order of conditional probabilities (i.e., you know P(T|D)
but want to know P(D|T)). It helps us better interpret positive tests, especially when
diseases or conditions are rare. For a rare condition, the majority of positive tests
may be false-positives which means that testing positive may still reflect a low
probability of actually having the condition.
Problem 9. Provide a real-world example (that has not been used in this course) to
which Bayes’ rule might be applied. In the context of that example, define what are
meant by prevalence, sensitivity, and specificity.
Many possibilities. This is just one example.
Airport security detecting excess liquids (beyond allowed limit).
- Prevalence: Probability of a randomly selected passenger carrying excess liquids.
- Sensitivity: Probability that a person carrying excess liquids is detected.
- Specificity: Probability that a non-carrying person is correctly cleared.
Problem 10: A medical diagnostic test for a particular disease is “95% accurate.”
This means, if someone who has the disease is tested, the probability of a positive
test is .95 and, if a person who does not have the disease is tested, the probability
of a negative test is .95. If a person tests positive, should they be concerned? What
additional information would they need to know in order to properly judge their risk?
The person should not necessarily be concerned. We need to know P(D|T) to assess
the ramifications of a positive test (T). To compute this using Bayes rule, we need to
know P(D) = prevalence of the disease
Problem 6. A filter identifies 95% of spam emails correctly and mislabels 3% of
non-spam emails as spam. If spam makes up 20% of all emails, what is the
probability that an email marked as spam truly is spam?
P(S | MS) = P(MS | S) * P(S) / P(MS)
P(MS) = .95(.2) + .03(.8) = 0.214
P(S | MS) = .95 * .2 / .214 = .888
Problem 7. Consider a test used to diagnose a disease. The test has a sensitivity of
95% and a specificity of 90%. If the disease prevalence is 2%, find the probability
that a person testing positive actually has the disease.
P(D | T) = P(T | D) * P(D) / P(T)
P(T) = (.95*.02 + .10 * .98) = .117
P(D|T) = .95 * .02 / .117 = 0.16239