How many intervals (or 'bins' or 'classes') should be chosen when creating a histogram?
\ Most often, about 8-10.
Eleven.
It can vary - it really depends on the distribution of the variable.
A minimum of 5.
It can vary - it really depends on the distribution of the variable.
2
New cards
If you wished to graph/chart the 'Friends' variable, which of the following would be the best choice?
\ Histogram
Pie Chart
Bar chart
None of the above would be reasonable choices.
Histogram
3
New cards
How would you describe the skew (if any) of this histogram?
\ Left skew
Right skew
Symmetric
None of the above
Left skew
\ Skew is described as the direction of the tail.
4
New cards
What measure of center would you use to describe this distribution?
\ Mean because it is symmetric
Mean because it is skewed
Median because it is symmetric
Median because it is skewed
Median because it is skewed
\ If data is skewed, you should typically avoid using the mean as your measure of center.
5
New cards
If you wanted to calculate the center of the distribution for the 'hours online' variable, which measure should you use? (You may assume that this dataset is not skewed).
\ Mean
Median
Standard Deviation
All of the above would be reasonable choices.
Mean
\ Mean is probably your best choice here. If the data were clearly skewed, then you should choose median. Standard deviation is a measure of the \*spread\*, not of the center.
6
New cards
ccording to this histogram, what is the most common mileage achieved by cars driving in the city?
\ About 15
About 20
Between 15 and 20
Over 25
Between 15 and 20
7
New cards
According to this histogram, approximately how many cars have a city gas mileage \*below\* 40?
\ About 1
About 2
About 14
About 33
About 33
8
New cards
How many cars have a gas mileage of below 15?
\ 1
7
9
24
9
9
New cards
True/False:A histogram should account for 100% of the observations in the dataset. For example, this gas mileage histogram should account for ALL of the 34 cars in the dataset
True
10
New cards
Of the following options, which best describes the shape of this distribution? You may ignore the outlier at 55-60 mpg.
\ Bimodal
Uniform
Right-skewed
Bell-shaped
Bell-shaped
11
New cards
Of the following options, which best describes the shape of this distribution? Hint: Do you think the density curve drawn on this figure is accurate?
\ Bimodal
Uniform
Right-skewed
Bell-shaped
Bimodal
\ The density curve drawn here should extend to complete the rest of the plot. In this case, we would have a second peak - almost as though there are two different Normal distributions in this dataset. So in this case, we would have a bimodal distribution.
12
New cards
Consider the following ages from the classroom dataset referred to earlier. They have been ordered for your convenience.
19, 21, 21, 22, 23, 24, 27, 84 \n What is the median age?
\ 22
22\.5
23
30\.125
22\.5
\ When you have an even number of datapoints, then the median is the halfway point (average) between the two middle values. In this case, the midway point between 22 and 23.
13
New cards
Consider this dataset consisting of ages of several students in a small class. What would be the best statistic for describing the center of this distribution?
\ Mean
Median
Mode
All of the above are reasonable choices.
Median
\ The 84 year old will probably skew the mean somewhat - particularly as there are only 8 measurements. As a result, the median is probably your best bet for a measure of center.
14
New cards
You take an IQ test and are told that you are in the 52nd percentile. What does this mean?
\ 48% of people who took the test have a higher IQ than you.
52% of people who took the test have a higher IQ than you.
Your IQ is considerably higher than average.
None of the above.
48% of people who took the test have a higher IQ than you.
\ Percentile refers to the percentage of datapoints BELOW the reported value.
15
New cards
The third quartile corresponds to which of the following?
\ The 25th percentile
The 50th percentile
The 75th percentile
None of the above
The 75th percentile
16
New cards
In the boxplot shown below, what is the mean?
\ 30
35
40
Can not be accurately determined from this plot.
Can not be accurately determined from this plot.
\ The middle line of a boxplot gives the median (Q2) - not the mean. Do not confuse those terms. They may both start with the same letter, but they are, of course, different measures of center!
17
New cards
True/False; The following appears to be a reasonably good 5-number summary for this boxplot: 10, 20, 30, 60, 65
True
18
New cards
If you choose to remove an outlier from your dataset, what must also always be done?
\ Replace it with a different, legitimate value.
Contact the source of your original dataset to let them know they have faulty data.
Indicate somewhere in your discussion that you have done so.
None of the above.
Indicate somewhere in your discussion that you have done so.
\ It is vital to always let those reading your report know if you have removed any observations from your analysis.
19
New cards
Which of the following is \*most\* likely to be considered a valid reason for removing an observation due to it being an outlier?
\ The observation is clearly a data entry error.
The observation is outside the typical range of the other observations.
Neither of the above are valid reasons.
The observation is clearly a data entry error.
\ Data entry errors are not terribly common, but if you spot one, then it is certainly a good reason to remove it. Just because an observation is well outside the range of the other observations, is not in of itself, a valid reason to remove it.
20
New cards
True/False: If you had to guess, the car make that achieves 55-60 mpg on the road should probably be considered an outlier.
True
21
New cards
True/False: For a dataset with outliers, the best statistic to describe the center of a distribution is the mean.
False
\ If you have outliers, this may well skew the mean outward in the direction of the outlier (either high or low depending on where the outlier is). When you are trying to decide between mean and median and there are outliers present, you should typically use the median.
22
New cards
Which of the following measures of center is said to be 'resistant' to outliers?
\ Mean
Median
Both are resistant to outliers.
Neither are resistant to outliers.
Median
\ The mean is definitely affected by outliers. The median is not - unless there are many outliers present. However, if there are many, many outliers then... they should probably not be considered outliers! Remember that an outliers is a value that is considered to be atypical or abnormal for that particular dataset.
23
New cards
What is the Interquartile Range (IQR)?
\ The distance between the first and third quartiles.
The distance between the 25th and 75th percentiles.
Both of the above answers are correct.
None of the above.
The distance between the first and third quartiles.
\ Remember that Q1 is synonymous with the 25th percentile and Q3 is synonymous with the 75th percentile.
24
New cards
What is the IQR for this boxplot?
\ 40
30
20
None of the above
40
\ Q3 is 60 and Q1 is 20. Therefore the IQR = 60-20 = 40.
25
New cards
Here is the 5-number summary for a certain dataset: 10, 20, 25, 30, 42.
True/False: According to the 1.5 IQR rule, a value of 48 should be considered an outlier.
True
\ IQR is the difference between Q1 and Q3, in this case, the difference between 20 and 30 which is 10. 1.5\*10 = 15. We then compare the suspicious value (48) and compare it with Q3 (30). Since it is MORE than 15 above Q3, we WOULD consider 48 to be an outlier.
26
New cards
What is the proper or 'official' name typically applied to this distribution?
\ Bell-shaped distribution
Symmetric distribution
Density curve distribution
Normal distribution
Normal distribution
\ While this distribution is bell-shaped and is also symmetric, the proper name for it is the Normal distribution.
27
New cards
On this graph, name
\- (1)the distribution, - (2)the density curve
\ Uniform distribution / The straight red line
Uniform distribution / The relatively even bars
Histogram distribution / Symmetric density curve
Not a valid distribution / There is no visible density curve on this chart.
Uniform distribution / The straight red line
28
New cards
Identify the distribution pictured here:
\ Left skewed
Right skewed
Uniform
Distribution can not be identified without a histogram.
Left skewed
29
New cards
If you had to make an educated guess, what would you imagine to be the most common distribution encountered in 'real life'?
\ Right skewed distribution
Left skewed distribution
Normal distribution
Bimodal distribution
Normal distribution
30
New cards
According to this graph, which of the following represents the approximate mean gestation time?
\ Between 230 and 270 days
Less than 250 days
More than 250 days
250 days
250 days
31
New cards
What percentage of people had a gestation time of less than 250 days?
\ Less than 50
More than 50
About 50%
Unable to determine
About 50%
32
New cards
Make your best estimate: Approximately what percentage of people would be expected to score MORE than 8 on this grade equivalency test?
\ 40%
50%
60%
90%
40%
33
New cards
Make your best estimate: Approximately what percentage of people would be expected to score MORE than 4 on this grade equivalency test?
\ 10%
40%
50%
80%
80%
34
New cards
What is the difference between variance and standard deviation (SD)?
\ They are essentially the same thing. SD is simply the square root of the variance.
SD is is more accurate than variance.
Variance is used for quartiles whereas SD is used for accurate estimation under the curve.
They are essentially the same thing. SD is simply the square root of the variance.
35
New cards
True/False Standard deviation should generally be limited to distributions that are symmetric and have no outliers.
True
\ Since we use the mean in our calculation of SD, we should avoid SD in situations involving outliers/assymetry.
36
New cards
When attempting to describe and/or analyze a distribution, which of the following is the most important number to consider?
\ Variation and Range
Center (using mean or median)
Spread (e.g. using standard deviation)
BOTH center and spread are very important in describing a distribution.
BOTH center and spread are very important in describing a distribution.
37
New cards
The variance of a distribution is calculated to be 9. What is the standard deviation?
\ 0
3
18
They describe different concepts, and therefore, can not be determined from the information given.
3
\ Remember that the SD is simply the square root of the variance.
38
New cards
True/False: Not all values in a dataset with a Normal distribution can be converted to a z-score.
False
\ Every value in a Normal dataset can be converted to a z-score.
39
New cards
You convert a datapoint to a z-score of -2.4. What does this value tell you?
\ Your observation lies 2.4 standard deviations below the mean.
Your observation equals 2.4 times the mean on the negative side.
Your observation is in the 2.4 percentile below the mean.
None of the above
Your observation lies 2.4 standard deviations below the mean.
40
New cards
What does the Greek character 'mu' represent? \n Tip: 'mu' is the character that looks like a funny-shaped 'u'.
\ The standard deviation of a population
The standard deviation of a sample
The mean of a population
The mean of a sample
The mean of a population
41
New cards
Given a mean of 100 and a standard deviation of 5, what is the z-score of 115?
\ \+1.5
\-1.5
\+3
\-3
\+3
42
New cards
Given a mean of 20 and a standard deviation of 4, what is the z-score of 25?
\ Somewhere between 0 and 1
Somewhere between 1 and 2.
Somewhere between 0 and -1
Somewhere between -1 and -2
Somewhere between 1 and 2.
43
New cards
Using the formula discussed in lecture: z = (x-mean)/sd
What is the z-score for an observation of 15.5 given a mean of 20 and a standard deviation of 3?
\ \-1
\-1.5
\-2
None of the above
\-1.5
\ z = (15.5-20) / 3
44
New cards
What is the z-score for an observation of 27 given a mean of 20 and a standard deviation of 3?
\ \+2.5
More than 3
\-2.5
None of the above
None of the above
\ z = (27-20)/3 = 2.333
45
New cards
The following distribution has a mean of 10 and a standard deviation of 2. What is the z-score for an observation of 6?
Generous hint: Examine the distribution....
\ \-2
0
\+2
None of the above.
None of the above.
\ We should avoid using standard deviation for a highly skewed distribution.
46
New cards
You look up a certain z-score using either R's pnorm() function, or on a standard normal table (z-table). What does the value that you find tell you?
\ The proportion of observations that lie under the curve to the left of your z-score.
The proportion of observations from the histogram that lie under the curve to the right of your z-score.
None of the above.
\
The proportion of observations that lie under the curve to the left of your z-score.
\ The value on a z-table tells you the percentage of observations that lie to the LEFT of the area under the Normal curve for that z-score.
47
New cards
You look up a z-score using R's pnorm() function and are given a value of 0.342. What does that value tell you?
\ About 34% of observations lie to the left of your z-score.
About 66% of observations lie to the right of your z-score.
Both of the above.
None of the above.
Both of the above.
48
New cards
Given: N(3, 0.220), which (if any) of the following statements is FALSE?
\ The dataset is approximately normally distributed.
The standard deviation is 0.220
The mean is 3
All of the above statements are TRUE.
All of the above statements are TRUE.
\ The 'N' is statistical notation that says the data follows a Normal distribution. The first number in the parentheses always refers to the mean, while the second number always refers to the SD.
49
New cards
A dataset composed of the following values is follows a Normal distribution: 59 60 61 62 62 63 63 63 64 64 65 66 67 68 Is it possible to calculate a z-score for the value 63.49?
\ Yes, you can calculate a z-score for any value within the range of a Normal distribution.
Yes, you can calculate a z-score provided you are near the mean of a Normal distribution.
No, because the value is not represented in the dataset, you can not calculate a z-score.
No. Just because.
Yes, you can calculate a z-score for any value within the range of a Normal distribution.
\ Once you know a dataset follows a Normal distribution, if you have access to ALL of the values in that dataset, then you can calculate the mean and SD. Once you have a mean and SD, you can calculate a z-score for ANY value. That is, you are NOT limited to values that were present in the original dataset.
50
New cards
Use R and your z-score formula: What is the proportion (or percentage) of values that lie below a z-score of -0.30?
\ 0\.378
0\.382
0\.617
None of the above
0\.382
\ The value returned by the pnorm() function is the proportion of values that lie below the provided z-score.
51
New cards
Using R and your z-score formula: What percentgage of observations are \*above\* a z-score of -0.22?
\ 41\.3%
41\.7%
58\.3%
58\.7%
58\.7%
\ This value comes straight out of the pnorm() function. Look up a z score of -0.22 and you will find a value of about 0.413. Recall that the results shown from pnorm() or a z-table are given as proportions. To convert to a percentage, simply multiply by 100.Note that this value represents the area to the LEFT of z=-0.22. If you wanted the value to the RIGHT of z=-0.22, you would take (1-0.413) which is 0.587. That corresponds to a percentage of 58.7
52
New cards
Using R, what is the area under the curve (rounded to the nearest percentage) to the right of z=0.31?
\ 35%
38%
62%
None of the above%
38%
\ R tells us that the area to the LEFT of 0.31 is 0.62. If you want the area to the RIGHT of z=0.31, then you would need to calculate (1-0.62) which is 0.38 (or 38%).
53
New cards
A group of women are sampled and their heights give the distribution (in inches): N(64.5, 2.5) What percentage of women are \*shorter\* than 68 inches?
\ 7\.9%
8\.1%
92\.1%
91\.9%
91\.9%
\ z = (x-mu)/sigma = (68-64.5)/2.5 = +1.5 Since we are asking for women SHORTER than 68 inches, we are interested in the area to the LEFT of z=+1.5 which is 91.92%.
54
New cards
A randomly selected individual from the N(64.5, 2.5) distribution tells you she is at the 84th percentile for height. How tall is she?
\ About 62 inches
About 67 inches
About 68 inches
Not enough information to determine.
About 67 inches
\ For this, we need to use R's qnorm() function. qnorm(0.84) gives us 0.99. Working backwards, the 84th percentile corresponds to a z-score of about 1 (or 0.99). Then it is simply a matter of solving for x: z = (x-mu)/sigma x = (z\*sigma) + mu x = (1\*2.5)+64.5 = 67
55
New cards
A student scores 24 on this year's ACT exam. He describes it as a fantastic score since the "average was only 18". Which of the following would be a reasonable response?
\ I agree sounds like a very good score.
It sounds like a fairly good score, as you were above the mean, but it was only by 6 points.
Sounds like a good score as it is above average, but I can't tell just how good it was without some measure of the variation.
None of the above are reasonably accurate responses.
Sounds like a good score as it is above average, but I can't tell just how good it was without some measure of the variation.
\ Suppose the standard deviation was 12 points? In this case, the student would have only finished 0.5 standard deviations above the mean which is respectable, but not 'fantastic'!
56
New cards
In the image seen here, which of the curves has the lowest standard deviation?
\ Blue
Green
Purple / Burgandy
The standard deviations are all the same
Not possible to say from the information given.
Blue
\ The wider the curve, the larger the spread. I.e. The widest curve has the largest standard drviation.
57
New cards
In the image seen here, which of the curves has the highest mean?
\ Blue
Green
Purple
The means are all the same
Not possible to say from the information given.
The means are all the same
\ The mean of a Normal distribution is the center of the curve. Note that this only applies to the mean of a Normal distribution. If you are looking at a skewed distribution, the mean is NOT at the center of a curve. Key Point: Different rules apply to different distributions. DO NOT allow yourself to get 'lazy' about this detail, as it will come up repeatedly throughout the course and on exams.
58
New cards
Two students are applying to your graduate school program.
Student A had a GPA of 3. Her school's GPAs had the distribution N(2.5, 0.5). \n Student B had a GPA of 3.5. Her school's GPAs had the distribution N(3, 1). \n Which student performed better within their school?
\ Student A
Student B
They are approximately equal.
Not enough information given to evaluate their relative scores.
Student A
\ Student A has a z score of +1. Student B has a z-score of 0.5.
59
New cards
The gestation time for a group of women in a study was N(265, 25). \n How many days did the upper 50% of women carry their babies?
\ More than 265 days.
Less than 265 days.
Between 240 and 290 days.
Unable to determine.
More than 265 days.
\ 50% of women carried their babies for 265 or fewer days, while 50% carried 265 or longer.
60
New cards
The gestation time for a group of women in a study was N(265, 25).
How many days did the upper/top 84% of women carry their babies? \n Hint: If I asked you about the upper/top 10% of women, this would refer to the 90th percentile. If I asked you about the top 20% of women, this would refer to the 80th percentile and so on...
\ About 240 or more days.
About 240 or fewer days.
About 290 or fewer days.
About 290 or more days.
About 240 or more days.
\ The upper 84% of women translates to the 16th percentile. If you look up the 16th percentile on a z-table, it corresponds to a z score of -1. Now we solve for x:
z = (x - mu) / sigma \n x = (z\*sigma) + mu \n = (-1\*25) + 265 = 240
61
New cards
The gestation time for a group of women in a study was N(265, 25). \n About how many days did the lower (bottom) 16% of women carry their babies?
\ About 240 or more days.
About 240 or fewer days.
About 290 or fewer days.
About 290 or more days.
About 240 or fewer days.
\ Yes, this is the exact same question as the previous one, just stated somewhat differently.
62
New cards
Use the empirical rule to answer the following:
The score distribution for a certain standardized exam was N(17, 0.3). What percentage of students scored between 16.7 and 17.3?
\ About 16%
About 32%
About 68%
About 84%
About 68%
\ This corresponds to the range between z=-1 and z=+1. Recall that about 68% of observations lie within this range.
63
New cards
Use the empirical rule to answer the following:
\ The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored between 16.1 and 17.9?
68%
95%
99\.7%
Unable to determine if using the empirical rule.
99\.7%
\ A score of 16.1 corresponds to a z of -3. A score of 17.9 corresponds to a z of +3. About 99.7% of observations lie between z=-3 and z=+3.
64
New cards
Use the empirical rule to answer the following:
\ The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored \*below\* 17.6?
2\.5%
95%
97\.5%
Unable to determine if using the empirical rule.
97\.5%
\ 17\.6 corresponds to a z-score of +2. 95% of observations lie between -2 and +2. This means that 5% of observations lie below -2 and +2. Splitting: 2.5% lie below z=-2, and 2.5% lie above z=+2.
65
New cards
Use the empirical rule to answer the following: \n What percentage of observations lie above z=+3?
\ 0\.15%
99\.7%
99\.85
None of the above
0\.15%
\ Since 99.7% of observations lie between -3 and +3, then 0.3% of observations lie above and below. This means that half of 0.3% (i.e. 0.15%) lie above z=+3.
66
New cards
Use the empirical rule to answer the following: The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored between 16.4 and 17.3?
\ 81\.5%
84%
95%
Unable to determine by using the empirical rule.
81\.5%
\ This corresponds to the range between z=-2 and z=+1. Between -2 and +2 lies 95% of observations, so from -2 to 0 is half of that: 47.5%. Between z=0 and z=+1 lie 34% (i.e. half of 68%) of observations. Add the lower 47.5% to the 34% gives us 81.5%.
67
New cards
What is the name given to the statistical plot that can guarantee that our distribution is Normal?
\ The Normal Quantile Plot
The Normal Probability Plot
The "I can Guarantee that this Distribution is Normal" plot.
There is no statistical tool/plot that can make such a guarantee.
There is no statistical tool/plot that can make such a guarantee.
\ Remember that while the normal quantile plot SUPPORTS the idea of normality, it does not guarantee it.
68
New cards
You plot a dataset on a normal quantile plot and find the result seen here. It does indeed appear to be a straight line. What can you conclude?
\ You are certain the data is Normal.
You are certain the data is not Normal.
You are encouraged that your data follows a normal distribution. But you are also aware that you can't be 100% certain.
You are encouraged that your data follows a normal distribution. But you are also aware that you can't be 100% certain.
\ A straight line is suggestive of normality, but does not guarantee.
69
New cards
You plot a dataset on a normal quantile plot and find the result seen here. The line is clearly curved. What can you conclude?
\ You are certain the data is Normal.
You are certain the data is not normally distributed.
You are encouraged that your data follows a normal distribution. But you are also aware that you can't be 100% certain.
None of the above.
You are certain the data is not normally distributed.
70
New cards
You are attempting to describe the "strength" of a relationship between two variables. Of the following plots, which relationship do you think has the highest strength?
\ The one labeled r=0
The one labeled r=0.4
The one labeled r=0.9
None of the above appear to have a strong relationship.
The one labeled r=0.9
\ When the dots are grouped closely together along a straight line, the relationship can be considered strong.
71
New cards
On a scatterplot, the explanatory variable is typically plotted on which axis?
\ x-axis
y-axis
Either one - it depends on the context.
x-axis
\ The exaplanatory variable should be plotted on the x-axis.
72
New cards
You are trying to examine a relationship between the amount of soda (or 'pop' if you are from the midwest!) consumed, and weight gain. In particular, you want to see if the amount of soda a person drinks results in weight gain. Which do you think should be considered the explanatory variable, and which the response?
\ Explanatory: Soda Consumed, Response: Weight gain
Explanatory: Weight Gain, Response: Soda Consumed
Explanatory: Soda Consumed, Response: Weight gain
\ The variable that "explains" or describes the result should be considered the explanatory variable. In this case, because we suspect that soda consumption causes weight gain (as opposed to weight gain "predicting" soda consumption!), we should make soda our explanatory variable.
In the scatterplot shown here, would you consider the datapoint at the top right to be an outlier?
\ Yes because it is far off from the other observations on the x-axis.
Yes because it is far off from the other observations on the y-axis.
No because while it is far from the other observations, it is still close to the regression line.
None of the above are correct statements.
No because while it is far from the other observations, it is still close to the regression line.
\ As long as a datapoint is close to the regression line, it still follows the 'pattern' of the other observations, and therefore, should not be considered an outlier.
75
New cards
In the image shown here, which of the four plots would be the best choice to display?
\ Top left
Top right
Bottom left
Bottom right
Bottom left
\ Ideally, a scatterplot will choose scales that allow us to spread the plot throughout the whole area.
76
New cards
In the image shown here, most of the plots have very poorly chosen scales. Why can this be a problem?
\ Scales that are abnormally stretched or compressed can give very misleading impressions about our data.
Scales that are abnormally stretched or compressed usually make our regression line seem more linear than it really is.
Scares that are abnormally stretched or compressed usually make our strength seem weaker than it really is.
None of the above.
Scales that are abnormally stretched or compressed can give very misleading impressions about our data.
77
New cards
Given the options below, which would you estimate as the best value for r ?
\ r = 0.1
r = -0.1
r = 0.7
r = -0.7
r = 0.9
r = -0.9
r = -0.7
\ The direction is negative, hence the negative value for r. The correlation is neither very strong (not close to 1), nor very weak (close to 0). So the best choice among those given is -0.7 .
78
New cards
Which of the following facts about r is FALSE?
\ r must be between -1 and +1.
r tells us BOTH the direction and strength of a relationship.
An r of +0.1 indicates a strong relationship.
An r of -0.9 indicates a strong relationship.
An r of +0.1 indicates a strong relationship.
\ A number close to 1 - even it is is negative, indicates a strong relationship. The minus sign only tells you that the DIRECTION is negative.
79
New cards
Given the options below, which would you estimate as the best value for r ?
\ r = 0.1
r = 0.5
r = 0.9
It is not appropriate to use r in this situation.
It is not appropriate to use r in this situation.
\ This is a non-linear relationship, so calculating the correlation coefficient should not be done with this data.
80
New cards
What conclusions would you draw from this plot? (For those of you who have read ahead, you may assume there is also causation. If you don't know what I mean by this, you can ignore...)
\ If the number of powerboats is restricted (reduced), fewer manatees will die.
If we increase the number of manatees, say through a successful breeding program, more powerboats will be allowed.
If we increase the number of powerboat licenses, the manatee population will increase.
None of the above.
If the number of powerboats is restricted (reduced), fewer manatees will die.
\ I understand that some of the answers to this question are a bit awkward, but, unfortunately, this is often how things are discussed/explained in the real world.
81
New cards
How do we decide on the best regression line on a scatterplot?
\ By using a widely accepted mathematical model/formula such as the method-of-least squares.
By looking at the graph and carefully drawing the line that goes through most of the datapoints.
By looking at the graph and carefully drawing the line that has the fewest outliers.
All of the above are reasonable methods of drawing a regression line.
By using a widely accepted mathematical model/formula such as the method-of-least squares.
82
New cards
What is the chief problem with simply using a regression line as our model instead of going on to generate a regression formula?
\ Using the line is imprecise because different people may draw different lines.
Using the line requires us to estimate both the x and y values and is therefore imprecise.
There is no problem with sticking to the regression line! In fact, the regression line is better than the regression formula when it comes to making predictions
None of the above: The formula and the regression line are equally precise methods of predicting values.
Using the line requires us to estimate both the x and y values and is therefore imprecise.
\ The answer about different people drawing different lines is false: A proper line is NOT drawn by "estimating" it! A proper line is generated using a model such as method-of-least squares and would therefore look the same every time. The regression formula gives us a much more precise prediction than the line.
83
New cards
Suppose you are given the scatterplot shown here along with a properly generated regression line. What is the value of y^ for an x of 4?
\ About 2
Just under 4
Somewhere between 4 and 6
About 7
Somewhere between 4 and 6
\ Remember that the ^ character refers to a predicted value. The regression line is a tool we use for making predictions. So rather than look at any of the specific datapoints, we use the regression line.
84
New cards
In this scatterplot, given an IQ of 83, what would you predict as the grade point average?
\ 4
Can not be determined since there was no datapoint in the study that included an IQ of 83.
Hide question 3 feedback
The regression line is a tool for making predictions. The original dataset is simply used to CREATE the regression model. Once the model has been created, we no longer have to worry about the original datasets and can simply use the line/formula to make predictions.
4
\ The regression line is a tool for making predictions. The original dataset is simply used to CREATE the regression model. Once the model has been created, we no longer have to worry about the original datasets and can simply use the line/formula to make predictions.
85
New cards
You are interested in the relationship between people's age and how much money they spend on eating in restaurants each month. You ask a sample of 200 people their age, and record the amount of money they spent eating out. Which, if any, of the following would cause you to decide AGAINST generating a regression model?
\ You plot the data and it is a straight line.
You calculate a value for r and it is 0.1 .
The mean value for age and mean for money spent in restaurants are vastly different.
None of the above are problematic.
You calculate a value for r and it is 0.1 .
\ If r is very low, then the relationship is weak, and a regression model would probably give very poor predictions.
86
New cards
Here is the general regression equation: y^ = b0 + b1\*x. Which character represents the explanatory variable?
\ y
y^
x
b0
b1
x
87
New cards
Here is the general regression equation: y^ = b0 + b1\*x. Which character represents the y-intercept?
\ y
y^
x
b0
b1
b0
88
New cards
Given x- (x-bar, i.e. mean): = 1 Standard deviation x = 2 y- (i.e. y-bar) = 3 Standard deviation y = 6 r = 0.8 What is the value for b0?
\ \-6.2
0\.6
2\.7333
None of the above.
\
0\.6
\ This is direct plug-and-calculate... You just need to keep track of the variables!
89
New cards
The graph and resulting model shown here was generated using appropriate statistical techniques. A person drinks 3.25 beers. What would you predict their BAC to be?
\ 0\.045
0\.050
\
0\.045
\ If you have a formula generated from a valid model, it will give you a more precise answer than trying to eyeball the answer from a regression line.
90
New cards
You are interested in the relationship between the amount of hours students spend studying, and their performance on their midterm statistics exams. You look at 3 students (n=3) and accurately record this information. You plot the information, and it does appear to be linear. You calculate r and see that it is very close to +1. You then generate the regression formula: score^ = 42 + # hours \* 8 What would you predict as their score of they studied for 4.5 hours?
\ 76
78
Unable to determine without knowing the exact value for r.
None of the above. The sample size is very small and as a result, we can not have much confidence in this model.
None of the above. The sample size is very small and as a result, we can not have much confidence in this model.
\ This model is linear and has a very high r -- this is usually a good thing! However, while you can certainly plug in numbers here, the very small sample size would (hopefully!) bother you and lead you to have very little confidence in this model. Remember that this course is about concepts, NOT about plugging numbers into formulas. It's very important to recognize when a formula or technique should NOT be used!
91
New cards
This plot demonstrates the improvement in emotional quotient (EQ) over the years between a group of people from Mars and another group from Venus. Why was it important to distinguish these two categories on the plot?
\ One group shows a relatively strong relationship while the other does not.
One group shows a relatively linear relationship and the other does not.
Both groups demonstrate a linear relationship with fairly strong correlation, yet the regression models for the two groups are quite distinct.
None of the above.
Both groups demonstrate a linear relationship with fairly strong correlation, yet the regression models for the two groups are quite distinct.
\ In both cases, the relationships are strong. They are both clearly linear. Yet, the regression line would be very different for each of the two groups. If you had only one line, it would have to be somewhere in the middle, making it inaccurate for \*both\* groups!
92
New cards
True/False: The correlation coefficient is NOT resistant to (i.e. \*is\* affected by) outliers.
True
\ Because the calculation of r involves both mean and standard deviation of both x and y variables, it is affected by outliers - often significantly.
93
New cards
A relationship is hypothesized between hours studied and students' GPA. A study is done and the analysis shows a linear relationship with an r of 0.8 and an R^2 (r squared) of 0.64. What does the R^2 value tell us?
\ About 36% of students' GPA comes from factors other than the number of hours they study.
About 64% of students' GPA is determined by additional variables such as their age and experience in school.
About 80% (0.8) of students' GPA is determined by the number of hours they study.
About 36% of students' GPA comes from factors other than the number of hours they study.
\ R-squared, the coefficient of determination, attempts the quantify the degree to which the explanatory variable (e.g. number of hour studying) explains the response variable (e.g. GPA). The remaining percentage (1-R^2), then, must come from other variables.
94
New cards
A linear relationship is found between number of beers and blood alcohol level. The value for r is 0.7 . Clearly there are variables besides the number of beers that affects the BAC level (e.g. weight, race, gender, etc). If you had to give a number, how strong a role do you think the number of beers plays in determining the BAC level?
\ 0\.09
0\.3
0\.49
0\.7
0\.49
\ This question is asking you for R-squared, which is simply the square of r.
95
New cards
You are interested in the relationship between number of beers and blood alcohol level. You collect data, and the plot appears to show a very strong linear relationship. What should you probably do NEXT?
\ Confirm that the data is normal by drawing a normal quantile plot.
Generate a regression model and use it to begin making predictions.
Support your idea that the data is linear by drawing a residual plot.
Ask for help.
Support your idea that the data is linear by drawing a residual plot.
96
New cards
The residual plot of a relationship is plotted as shown here. Which of the following can you determine from this plot?
\ The relationship is linear.
The relationship is not linear.
This relationship is normal.
The relationship is not normal.
The relationship is not linear.
97
New cards
Examine the plot shown. If we were to remove the red dot, which of the following do you think would happen?
\ The regression line would more closely follow the blue dots.
r would get closer to -1.
Both of the above.
None of the above.
Both of the above.
98
New cards
How would you classify the red dot in this plot?
\ Influential
Outlier
Both outlier and influential.
Neither outlier nor influential.
Outlier
99
New cards
This image shows a scatterplot with a single influential point. What effect, if any, does this point have on the regression line.
\ It dramatically increases r.
Because it is only one dot, it does not have any real effect on the regrssion line.
It exerts a disproportional pull on the line upwards towards itself.
None of the above.
It exerts a disproportional pull on the line upwards towards itself.
\ An influential point, because it is far off on the x-axis from the other points, exerts a relatively strong pull on the line towards itself. While this in theory, could increase r, you really can't say. More often than not, it will probably decrease r. The key point is that this single influential point will in some way AFFECT r - and likely make the calculation of r LESS accurate.
100
New cards
The plot shown here shows the relationship between temperature and time (as we head into late fall and winter). As you can see, the relationship is linear and with a reasonably strong correlation. What would you predict the temperature to be for the week of November 21st (11/21)?
\ At the y-intercept, that is, about 49 degrees.
A little below 49 degrees, say, about 47.
Unable to determine as the dataset range does not include this period.
Unable to determine as the dataset range does not include this period.
\ Even though the line has been drawn out, it goes well beyond the range of the observations that were used to create the model. This is called extrapolation and should be avoided. I realize this may seem like a trick question - but this is not my objective here! I admit to having fallen for this kind of thing myself before. It's something we all need to be vigilant about!