biostats2010 week 4a part 2

biostats week 4a part 2

Transcript

All right. If you're all doing well, having a good week. Today, we're just going to press on the stuff with the normal distribution.

We're going to start off with a little bit of review.

And then it's going to be mainly practice today with the calculator or the calculator app, whichever you're choosing.

So we'll get that handy, go ahead and get that out.

I'm just going to pretty much do practice problems for most of the lunch today.

And in case you missed Monday, we don't have plans this Friday.

I'm going to be traveling Friday. And I'm going to move the homework that was due Friday to Monday.

So if you all can ask questions Monday, we'll be back.

I know some people have like the last minute questions, so if you skip a Friday, I'd like to move that forward.

So whenever I calculate three different types of probabilities in the normal distribution under the curve, right, all we're doing is calculating areas under the curve, We can calculate a left tail probability, cumulative left tail probability.

You can calculate a right tail probability, also a cumulative probability.

Then we can calculate a probability in between two things, right?

We're mainly working with the table on Monday. The table only gives us which one, the left tail or the right tail?

Exactly. So the table only gives us the left tail probabilities.

So in order to get a right one, we have to do an extra work.

We stated in terms of a little probability, we can do that using symmetry.

And then same thing goes when we're trying to find probability in between two things.

We just have to take the difference between the larger one and the smaller one.

So those were the three probabilities we could do in the normal distribution of the curve, but there was this one other thing we could do as well, and that was actually find a value in the data set that corresponds to a certain probability.

So this is, I think, where we left off with this class.

So we were looking at seron cholesterol levels. We have a mean, we have a standard deviation, and under this mean and standard deviation in the normal distribution we're asked what level of serum cholesterol is exceeded by only 10% of the population.

So what is this even asking? What kind of things should we consider here? Is this asking for probability or is it asking for something different?

So is this asking for a probability or is it for a value?

Value. Yep, because that's for a value in the actual dataset.

So essentially, like I said when you're starting out with these, it's always good to draw things out.

So essentially what it's asking for is what's this little mark off here, right?

We can, you can take 10% of the data that is above a certain line.

So essentially, the first thing we're going to look for is that z critical value that corresponds to a right probability of 0.1.

So how would we first go about doing that? If you find the inverse with the table, remember how we did that.

What's up? So the compliment does come into play here because it is a right-hand probability.

So we have to find the complement of this first, right, and then what do we do after that to get the actual C-critical value?

We go to the table, right? The center of the table has the probabilities and the margins have the Z critical values of the Z-sports.

So we go in, find the value, and we go out to the margins to get that Z-sport.

So you're right, compliment first thing. So we want to restate it in terms of left tail probabilities.

So we want to know this z critical value that corresponds to a left tail probability of 0.9.

If we look that up on the table, we're going to get 1.28.

So we get this kind of rounds closest. And that's very small. But this round's closest to 0.9. We go over here, 1.28. Because on the left side is the tenths. On the top is the hundreds. We could also do it using symmetry. This way is a little confusing and involves a little bit more algebra.

So I'm not actually going to talk through that way.

It's not going to confuse anybody. But now we have our Z critical value corresponds to that left pale probability.

Oh, we finished yet, though. What do we need to do now? What are the questions asked for? It asks for what value, what level of serum cholesterol is exceeded by only 10% of the population.

So whatever our answer is going to be, we know it's going to be greater than 219, right?

This 219 is in the center there. So we have 1.28, right? But are we finished? What do we need to do now? So is this 1.28? Is that a value in the actual data? Nope. What is 1.28? That's just a z score, right? So how do we get from a data point to a z score? We use that trust-a-phone, right? We have to do x minus the mean over sigma. So how can we get this back to an actual point in the data set?

We call up our old-frame algebra to convert it back to a point in the set.

So we have a z-score. We need to convert it back to our original random variable x.

Well, this was our formula. Z equals x minus mu over sigma. So we can just rearrange this. We can do z minus sigma, we'll have x minus mu over here.

We arrange things, change the positive and negative, and we get x equals z times sigma plus mu.

So we have our Z value, we already have our mu value, that's our mean, and then we have, oh sorry, we already have our sigma value, 50, and then we have our mean, which is 219, and we get 283.

So now we can say only 10% of the population has serum cholesterol levels is about 283.

So we found that part in the distribution. Any questions about that? And big reveal today, I'm going to show you how to do it with a calculator, so it's going to take far less steps.

All right, so next one, we're going to work through these.

So on average, a person gets 6.8 hours of sleep per night.

Assume that the standard deviation is 0.6 hours, the number of hours slept for night is normally distributed.

What is the probability that a randomly selected person sleeps between 7 and 8 hours?

So we'll start off with that one. Trying to find probability between two values, right?

So if we drew this out, would our shaded region be to the right or to the left of the mean?

The mean. So where's the mean? 6.8, right? If we want to know between 7 and 8, so that shade region is going to be to the right of the mean, we're finding the probability of being in between two things, right?

The probability of being in between 7 and 8, where are our parameters, what's our mean, what's our standard deviation?

Because 6.8, 0.6, right? So here's what the picture would look like. And let's all get our calculators out and I will show you how we can do this very quickly and easily with a calculator.

So what we're going to do now is go back to that second bar, the distribution, and we're going to use normal C yet.

So it's going to be that second option, second bar is normal C yet.

And That's just normal distribution, cumulative probability calculator, simply.

And we have three options here for probabilities.

We have, this is where it gets maybe a little confusing, We have a left tail probability, right?

We have a right tail probability, and then we have finding the probability in between two.

With the left tail probability, it's going to ask you for a lower bound or with all of these, it's going to ask you for a lower bound, an upper bound, and it's going to ask you for your E and your standard deviation.

So, for the lower bound, you're always going to put in negative infinity.

So you're going to put negative sign and a whole bunch of nines.

That is if you're finding a left tail, right? If your x value is on top and you're finding the shaded region to the left of that, your lower bound will always be negative infinity.

If you're finding a right tail probability, your lower bound is going to be your x value and then you're going to put in positive infinity for the second value.

Mu and sigma, they'll always be the same. And if you're finding the probability in between two things, you just put it the lower bound as x1 and the upper bound as x2.

So what we're trying to do, We're trying to find the probability of being in between 7 and 8 on the mean 6.8 and the signal is 8.6.

So which one of these three are we going to use in this situation?

So we've got, we're trying to find this, right? Which one of these three are we going to use for that?

The last one. Yep, we got, We'll use the last one. And what we put in is our X1 value. Seven. Seven, what we put in is our X2. Eight. Yep, and then what about for our mean standard deviation?

6.8 and 0.6. Yep, so let's all do that. And what do we get? Here we go. So 7, 8, 6.8, and 0.6. What are we coming up with? I don't want to written down, but what I have here is 0.34.

Is that what other people are getting? Matches. Okay. So a couple things are happening here. A couple things aren't happening. So when you do it with the calculator, you get to skip all that z-score stuff.

You don't have to standardize anything. You don't have to go to the table, obviously. So a lot of people kind of get tripped up in that. So if you're going to do it with the calculator, just like with the other distributions, this slide is your print, right?

As long as you can get the problem in one of these forms, right?

Whatever the asking you for in one of these forms, you just need to make sure, or just kind of stay organized on where you type things in the calculator.

The other thing that kind of trips people up here is that those upper and lower bounds.

So if you're trying to find a left tail, once again, your lower bound is always going to be negative, and if it's a negative, I'll hold a bunch of 9s.

If you're trying to find an upper tail probability, your upper bound is always going to be a positive thing.

So just a whole bunch of not. All right, another kind of thing that you can use the calculator for.

So the second question using this example is only 33% of people get less than what amount of sleep or not.

So what is this act actually asking? Is it asking for a probability or a value. Yep, it's asking for a value. It's asking for the value that corresponds to this left tail probability in this distribution.

So if we drew this out, we're kind of looking for what is this value in the data set that corresponds to this.

So if we were going to do this by hand, We would have to do that inverse thing where we go to the probability in the table and go out of the margins and do algebra to convert back into something that fits within the problem.

But like for us, we have calculator function that can do all this work.

So let's go back to second bars and we're going to scroll down to our third option in second bars which is inverse norm.

Any calculator issues by the way? Anybody have any issues with the app, locating these functions?

If so, please let me know. This will really speed up your workflow here in this class.

So when you do the inverse norm, it asks just for three different things.

It asks for the area. Area just stands for the probability or the percentage that you're given.

Sometimes you might have to convert a percent to a fraction or to a decimal because it's going to want an indecimal form.

And then it's just going to ask you to read in your sanity equation for a left tail probability, or sorry, for a value that corresponds to a left tail probability.

You use this. If you're looking For a value that corresponds to a left tail probability, do use this.

If you're looking for a value that corresponds to a right tail probability, you have to take a complement within the first value here within the area.

So it's just like something to keep in mind this little detail in the trip to people up so for the problem we're working on which one of these are we going to use the top one or the bottom yep you got top one.

So what are we going to type in the calculator? So we'll have inverse norm and What will our first value be if we put it?

The .33. You got it. The area or the probability of .33. And then we put in our mean and standard deviation.

And what's our value we get? Anybody come up with something? 6.54. You got it. So we use that inverse norm. We get 6.54. That is the value that marks off the lower 33% of the distribution thing.

We'll see this one too. So this is the one we did by hand with the table earlier.

So We were working with zero and cholesterol levels, and we had a mean of 219 and a signal of 50.

And we were asked what zero and cholesterol level is exceeded by only 10% of the population.

So same type of problem, right? We're looking for that value. But what are we going to type in the calculator for this one?

What's our area going to be? We'll give the cap a 1 and that's okay. So this one's asking for a value that corresponds to a right to probability.

So we're going to do some extra work on the front end.

What are we going to type in for the area? 1 minus 0.33. 1 minus 0.1, as I've already said. So 1 minus 0.1, so our area is going to be 0.9. And then we just put the mean, the standard deviation.

So 2, 19, and 50. And what have we come up with? Should be the second we came up with when we did it by hand.

We're still getting 283. Alright. We're still getting 283. Alright. We're going to do some more practice now, and I'm going to let you all kind of solve this one sort of on your own or in pairs.

You can do it however you want. I'm going to give you kind of a minute. I will read it out first, though. So some diabetes researchers have observed that weight gain during adolescence among diabetic patients is affected by the level of compliance with insulin therapy.

It's opposed to 12-year-old boys with type 1 diabetes who comply with their insulin shots have weight gain over one year that is normally distributed with mean, MEW and standard deviation 12, level 12, so that's a problem.

So what is the probability that a compliant type 1 diabetic 12 year old boy will gain at least 15 pounds over one year?

At least 15 pounds. So let's all work this out, whether it be hopefully you can do it in calculator because that would be much quicker.

Like I said, you can work with something next to you if you want.

Give you a minute to draw this one out and come back together.

Keep in mind with these cumulative ones, you got the lower, If you're looking for a left tail, your lower bound is negative infinity.

If you're looking for a right tail, your upper bound is positive infinity.

So just a whole bunch of nines, either positive or negative.

All right. First things first with these, any of these probability distribution problems?

What's our first step before we even calculate anything.

Write everything out. What was that? Write everything out. Yeah, we need to write everything out. We need to write what we're solving out. So what are we trying to solve here? Probability that x is less than, greater than, in between.

What is it? Greater than or equal to 15. You got it. Greater than, 15. Or greater than or equal to 15. And we have a mean and a standard deviation that are either the same thing here.

So we use the normal CDF function that corresponds to finding a greater than probability.

So we're going to put in 15 here, positive infinity, 12, 12.

And what do we get for our answer? You got it, .401. So just plug in stuff in and you get the problem. It's a nice thing to do. Any questions about that? Only 20% of compliant type 1 diabetic 12-year-olds will gain more than how much weight.

Funny way to word it, but only 20% of compliant type 1 diabetic 12-year-olds will gain more than how much more.

So is this asking for a value or is this asking for a probability of n?

Yep, it's asking for a value and it's giving us an upper tail probability that corresponds to that value.

So, I've been saying what value in the data marks off that top 20%.

So which calculator function are we going to use here?

So are we going to use a normal CDF? Are we going to use the other one? We're going to use the other one and that's the inverse norm.

So with the inverse norm, we have a few options. If we're given a left-tail area, we can just type it in P, mu, and sigma.

For given a right tail area, we can take the compliment.

So what are we going to have to do with this one? We're going to have to take the compliment, right?

So we'll do 1 minus q. So we'll put 0.8 in for our area, and then we'll put in mu and sigma.

So what are we getting when we do that? And once again, is that second bar is inverse in the one function.

So it should be 0.8, 12, and 12. What do we get? I want 22. You got it? That's right. Everybody else coming up with something along those lines?

All right, another example here. Assume that the, Anybody have any questions about that last one?

Any calculator issues? Assume that the readings on scientific thermometers are normally distributed with a mean of 0 degrees Celsius and a standard deviation of 1 degree Celsius.

Thermometer is randomly selected and tested. In each of the following cases, draw a sketch and find the probability of each reading in degree Celsius.

All right, so less than 2.5. I'll let you go ahead and work that out with the calculator.

This is sort of a drawing of what it's going to look like what we're doing.

We're going to be way above the mean here, right? Because the mean's at zero. We're going to be less than 2.75. Can I sort of imagine that it's going to be quite a bit of area of the whole curve?

But I'll let you figure out what you're supposed to type in the calculator here, And then we'll report back.

So with this one, Since we're finding a left tail probability, we're going to just use that normal CDF function that corresponds to that.

So what's our lower bound going to be and what's our upper bound going to be?

What's our lower bound going to be? What's our lower bound going to be? Negative infinity 2.75. You got negative infinity 2.75 on what's our mean?

What's our standard deviation? Maybe we'll see how it looks in the future. You got it. So hopefully everybody was able to chop this in and get something that looks like this.

Like I said, if you draw it out, you can already kind of un-imagine, oh, it's going to take up most of the curve, because 2.75 is so far away from zero.

Imagine it will be most all of the shaded region. Any questions about this stuff, about how to apply the stuff, calculate your things?

Like I said, I know that we've got a few weeks until the next exam, but the exam will include a lot of these calculations.

So, you know, being able to read a problem, be like, oh, which distribution is this?

And then be like, OK, what is it asking for? Left tail, left tail, inverse, and then salt them from there.

So most of the exam was applied like that. All right, so for the end of this lecture, we're going to take a little theory, detour here.

So before applying methods for the normal distribution, we have to make sure our variable actually conforms to the normal distribution, or at least approximately conforms to it.

And we're just going to talk about a few different ways.

The most simple version is looking at Instagram, right?

And we've already done that earlier in the semester.

How can we tell if a variable is normally distributed by looking at Instagram?

Instagram gives us the distribution, right? And if the distribution is roughly symmetrical, we can say it's probably not distributed, right?

There's also another plot we can look at that sounds kind of crazy, but it's easier to interpret than you think.

You can look at it with what's called a normal quantile plot.

I'll sort of refer to it as a QQ plot. So we're going to talk for a few minutes about these.

What they do is they plot observed values against expected normal Z scores.

So these expected normal Z scores are the Z scores you would expect to see after standardizing the observations if the data were actually normally distributed.

That sounds kind of insane, but like I said, it's easier to interpret than you can think.

Histogram, straightforward, If normally distributed, there should be symmetrical bell shape.

If not normally distributed, you should see some kind of skewness going in the right or left or outliers present.

If you look at a QQ plot, If things are normal, everything's going to fall on a nice diagram.

So, and you'll see in a minute here, if you have deviations for normality, that's when the points will sort of fall off the diagram.

So here is some data. We have the histogram for this data. And then we also have the QQ plot that corresponds to the data.

So how would you describe this histogram? Do you get symmetrical or do you get skewed? You're going to get symmetrical or is it skewed? Yes it's slightly skewed to the right. It's not terrible but it's slightly skewed to the right.

It's not terrible, but it's slightly skewed to the right.

It's like starting to look somewhat symmetrical here, but then you see it looking some skewness here.

And you can see in the corresponding QQ plot where the data falls on the diagonal.

That's where it's conforming more so to the normal distribution.

But then here on the tails here, you see that it's deviating from normality.

So this one is not terrible, but it's not great. Here we have some systolic blood pressure. What do we think about this data? You said it's pretty normal. Yeah, this is about as good as it gets. So we have this really nice histogram, evenly distributed, follows that bell curve, and then on our QQ plot, we have everything pretty much exactly on this diagonal.

You don't have a few deviations here, bottom and top, the things are looking pretty good And they were we think about this one Yes quite a bit of skewness in this one So like for this one you plot this data, your boss or the hospital system you're working with, your crunching data for it, they give you this.

What kind of measure of center are you going to want to express if you're reporting back on this data?

Medium. Excellent. Yeah. What measure of variation do we want to express as well.

So our resistance statistic outliers for the center is the median, excellent.

And then our resistance statistic for variability was interquartile range.

So a lot of studies that have data that's super skewed, they just express medians and they're quote tolerance.

Last little bit, and this is where people's heads skews are spin, But we had our probability density function for the binomial distribution.

What were our two parameters in the binomial distribution?

N and P, right? N stands for what? Total number of trials and then P is what? So probability of success on any one trial, right?

The whole function is on the left side we have this binomial coefficient, right?

It tells us how many ways we can arrange these probabilities.

Then we have the probability of success raised, the number of successes times the probability of failure raised, the number of failures.

So this one can become difficult as n gets large. And it turns out as the number of observations of n within the binomial distribution gets larger, probabilities we calculate from the bianonatal distribution will actually get closer and closer to the normal distribution.

Because of that, sometimes we use the normal distribution to approximate bianonatal distribution or other distributions.

So don't get too bogged down in all this. There's some kind of overarching things I want to let you know.

But when n is large, X can be approximately normal.

So what we can do is take what's going on in the binomial distribution, calculate N times P.

Remember N times P is that expectation, that mean value that we expect to see.

So if we had a probability of success as 0.4 and then we had 100 trials, we could just multiply 100.4 to get how many successes we would expect.

So that would be, we'd sub that in as our mean in the normal distribution.

We're using normal distribution methods on this data.

And then we could sub this in for our standard deviation.

I don't really stress about these two formulas. The main thing I want you to know is that you can do this when N is so big that N times P is greater than 10 and N times 1 minus P is also greater than 10.

So really overall you can do this if you have a very high end, a very high number of trials, a very high sample size.

So the accuracy of this normal approximation improves as sample size increases.

It's going to be most accurate when P is close to 0.5 and least accurate when P is closer to 0 or closer to 1.

So when we have something that's more middle, it's going to do a better job at approximating the binomial distribution.

So let's look at this example and we'll finish with this.

So about 60% of American adults are either overweight or obese according to the US National Center for health statistics.

Suppose that we take a random sample of 2,500 adults, what is the probability that 1,524 more of the sample are ever-weighted or obese?

So why is this a binomial variable? Well, we either have the variables either over the b's or not.

So that's the, where we get this 60% from. That's what they've counted in each individual.

So The number of the sample for either over A or B is a random variable X, where N is 2,500, right, that's our total sample.

And probability of success is 0.6. Once again, I know It doesn't sound successful, but that's whatever the thing is.

We're counting. So we said that approximating the binomial distribution with a normal distribution is going to be most accurate when our P is close to 0.5 at least when it's near 0 or 1.

Also, we also said the normal approximation increases the sample size increases.

So what do we think? I think we're good to use the normal approximation here.

We have a high sample size. Is our P pretty close to point five? Closer to point five, then it is to one or zero. So we're good to go, right? How are we going to do this? Well, we can get our mean or expected value by doing n times p, 2500 times .6, so we get 1500, And then we can get our standard deviation.

So we can do n times b times 1 minus b, slap square root on that, and we get 23.40.

So Now we have kind of converted this all into the normal distribution.

So now we're saying x is normally distributed on 1500 and 24.49.

And all we're doing is finding a right tail probability now.

That's what was our original question. It was, what's the probability that 1,524 more, right?

So we're doing right tail probability. And we can do this with the table or the calculator and we end up getting .201, right?

And then if we did this with a normal CDF, we get something very similar.

And if you actually did this with a binomial, you would get .21.

So these both line up when n gets really large and our probability success somewhere near 0.5.

We're not super meaningful to us right now in this class, but when we get to doing some kind of real statistical inference later in the class.

We will use this to do confidence intervals for our risk ratios and our odds ratios.

We have any questions about this? That was great. Awesome. Going to change the world better before you do that.

I'll see you next Monday. We're going to do that on my top.

 

 

biostats week 4a part 2

Slide 21 - Critical Thinking Opportunity Example:...

table only gives us left tail probabilities, so right tail needs work with symettry or complement

Slide 22 - Critical Thinking Opportunity Example:...

exact or only would amount to a value

to find z value, you would use complement first and then find the table with the margins of critical values

Slide 21 - Critical Thinking Opportunity Example:...

Slide 22 - Critical Thinking Opportunity Example:...

Slide 23 - Critical Thinking Opportunity Example: Cholesterol

once you find the answer to the z score you use the formula x-mean/standard deviation

Slide 24 - Applications of Normal Distributions Example:...

mean= 6.8

between 7 and 8 so to the right of the mean

standard devation= .6

X-N(6.8, 0.6) P(7<Z<8)

Slide 25 - 𝑃 ( 7 < 𝑍 < 8 )

Slide 26 - Calculating Normal Probabilities with a Calculator

Slide 27 - 𝑃 ( 7 < 𝑍 < 8 )

for the probabilities it's going to ask for lower and upper. for lower, you will always use negative infinity for right tail, infinity for left tail, and normal for between

plugging it into cac

Slide 30 - invNorm ( p,μ , σ )

Slide 31 - 𝑿 ~ 𝑵 𝟔 . 𝟖 , 𝟎 . 𝟔

Slide 32 - invNorm ( p,μ , σ )

third option in second vars "inverse norm"

use top slide in the scenario

Slide 33 - 𝑃 𝑍 > 𝑎 = 0 . 10

Slide 32 - invNorm ( p,μ , σ )

p=.33 (area of probability)

Slide 33 - 𝑃 𝑍 > 𝑎 = 0 . 10

Slide 34

mean= 12

standard deviation= 12

X-N= 12, 12 P(X>15)

Slide 35 - 20%

for this, use inv norm, where left tail distribution leads to X>x

Slide 36 - normal c df ( - 999999, 2.75, 0, 1 )= 0.997

in this equation, we're doing a sample that is HIGHER compared to the other set of data, which leads to X>x

negative infinity and 2.75 as parameters

x=2.75

Slide 35 - 20%

Slide 36 - normal c df ( - 999999, 2.75, 0, 1 )= 0.997

Slide 37 - Assessing Normality of a Distribution Before...

Slide 38 - • Q - Q plots graph observed values against...

q-q plots graph observed values against expected normal z scored

Slide 39 - Assessing Normality of a Distribution

when finding the probability with standard deviation and mean, USE x ON THE LOWER BOUND WHEN FINDING LESS THAN AND x ON THE UPPER BOUND WHEN FINDING MORE THAN

Slide 40 - Q - Q Plots

Slide 41 - Q - Q Plots

Slide 42 - Q - Q Plots

Slide 43 - Back to the Binomial Distribution Recall the...

n= total number of trials p= probability of success on any 1 trial

Slide 44 - Normal Approximation of the Binomial...

Slide 45 - Normal Approximation of the Binomial...

Slide 46 - Normal Approximation of the Binomial...

the proabbility is overweight or not, making it a binomial example

 

robot