RS

Unit 1

what's up my stat Stars welcome to AEP statistics unit 1 summary video in this

video we're going to go all the way through unit 1 exploring one variable data talking about all the major themes

and all the major Concepts to make sure that you are ready either for your unit one test or to help prepare you for the

AP test in may now before we begin I want to mention two really really important things first this is just a

review video we're not going to cover every single teeny tiny Topic in extreme detail that's what a class was for the

purpose of this video is to take everything that your teacher threw into the last couple weeks and put it into one digestible video that kind of covers

the big major themes of it all now if you are looking for much more specific videos that cover every single topic in

unit 1 and all the other units of AP Statistics please check out my YouTube channel I got videos for every single

topic explaining everything in much more detail in this review video or if you're really looking for a lot of great

information that can help you prepare for your unit test or the AP test please check out the all some review packing

using the link in the description at the ultimate review packet you can get a free trial to take a look at every

single unit you get study guides practice sheets practice multiple choice and you also get these awesome review

videos as well and the best part is you even get answer keys to study guides and

those practice sheets to make sure that you're doing everything okay at the very very end you can even do a full length

practice AP exam and the second thing I want to mention is Yes you heard me right study guide while you're at the

ultimate review packet please make sure to download my study guide for unit one I also got study guides for all the

other units and you can use that study guide while you watch this video you could pause fill Parts in hit play pause

fill some more parts in or you can watch the whole video and fill it all out at the end but the best thing is you got

access to that answer key so you can check out all the answers at the end and make sure that you're doing everything okay and if you even want more practice

to prepare you for the exam you can also check out my practice sheets alright let's get into you know what

foreign is all about exploring one variable data

we're really going to learn how to analyze one variable or how to take one variable and compare it across multiple

samples or multiple groups now listen understanding how to analyze data is super important may seem kind of boring

and not that fun at the beginning but what we need to do with analyzing data later on in statistics is so crucial to

the really the big important Concepts that are probably going to be the most challenging for you so if you understand how to analyze that and now it's going

to pay off big time at the end when we do some really important stuff now listen this unit is really broken down

into two things categorical data and quantitative data and I'm not going to lie to you categorical data is way

easier way faster way shorter in fact only a small percentage of this entire unit is even about categorical data much

much bigger part of the unit is over quantitative data but regardless of categorical or quantitative variables there's something really important that

you need to understand anytime you select a sample and from that sample you collect data any summary information

that you learn from that sample data is called a statistic whereas if you

collect information if you collect data from an entire population then anything you learn from that population is called

a parameter it's really easy to memorize these things because it basically here's the idea statistics starts with an S and

so does samples and statistics come from samples parameter starts with the p and parameters come from populations which

also serve to the P so it's pretty easy to remember that concept now we collect data from individuals and individuals

can be well honestly anything it can be a person it could be a chair it can be a tree can be a lake it could be a state

it could be a country it could be a day for that all that matters really an individual can be anything now here's

the most important part a variable is any characteristic that can change from

one individual to another so if you just think about a person or maybe multiple people think about any characteristic

that can change from one to another eye color hair color weight height just to name a few now the reason why we like

analyzing data so much is because individuals vary if individuals didn't vary well then Artisan wouldn't even

need this course and the world would be a pretty boring place now here's the deal with variables there's only two

types all variables in the world can be categorized into two types either categorical variables or quantitative

variables a categorical variable takes on values that are category names or

group labels like eye color or hair color whereas a quantitative variable takes on numerical values that are

either measured or counted like the weight of a frog or how many candies are in a bag to try to keep it really simple

a categorical variable value is simply going to be a word whereas a quantitative variable value is typically

going to be a number now there are a couple exceptions to that rule namely zip code ZIP code is a number but it's

not measured and it's not counted that doesn't make it quantitative a zip code is simply a number that tells your male

where to go which means it simply puts your mail into a specific category for your City's post office so that's why

zip code is one of those weird exceptions that's a number but technically categorical variable but to be honest in most cases it's pretty

straightforward categorical variables are words quantitative variables are numbers let's start off with categorical

data because it really is shorter and much faster to talk about there's just not a whole lot there now let's say that

we take a sample of 89 lemurs and one of the variables that we want to analyze from those lemurs is the type of lemur

it is whether it's a cystica an II a ringtail or a mouse limb probably pronouncing some of those wrong but

again those are all words which makes this a categorical variable now if we just have all that data collected it's

probably going to be a really long boring list of all those different categories so the first thing we'd like to do is organize that into what we call

a frequency table frequency is just a fancy word for cows here we list each of the categories and we simply count how

many under lemurs fit into each of those categories now we can also take a look at what's called the relative frequency

the relative frequency is just the proportion of lemurs that fell into each category so for example we take the

number of ring tail lemurs that we have we divide by 89 and we get the proportion now keep in mind that a

relative frequency a percentage or a rate all tell us the exact same information that a proportion does

they're all really basically the same thing however we really do like using proportions I'm not trying to say that

we're never going to use frequencies at all but we like relative frequencies a lot because when we were comparing two

samples especially two samples in different sizes using relative frequencies is a much more fair way to

compare them when it comes to making graphs of categorical data we really have two options pie charts or what some

people call Circle graphs and bar graphs now a bar graph could also be turned into what's called a relative bar graph

so instead of the heights of each bar showing the frequency or the number of lemurs that fall into each category it

simply shows the proportion whereas a circle graph only shows portions because the idea is each slice is a proportion

of the whole circle now when we look at a pie chart or a bar graph one thing that you might be asked to do is to

describe the distribution of that variable now what is a distribution because that's a really important word for this entire unit a distribution of

data is basically what values that that data takes on and how often it takes on

those values so if we're asked to talk about the distribution categorical data really all we could say is maybe which

category had the most which category had the least and maybe we could even mention all the different categories that are even available to us but

there's not a whole lot we can say oftentimes the best things we can do with either a bar graph or a pie chart

is compare two different samples so for example here we see a pie chart for the Lemurs in Force One a pie chart for the

Lemurs in Force 2 and because pie charts are based on proportions it's really easy to see some important differences

like we noticed in Force 2 there's a much higher proportion of Cisco's than there is in force one and we simply know

that by just seeing that the piece of that pie is much bigger in 4 or student Force One now what it's going to be

expected of you to answer questions about on the AP exam when it comes to categorical variables is really again like I said just describe the

distribution reading a bar graph also noticing if it's a relative bar graph so you can see what proportion or what

percentage of data falls into each category now let's move on the quantitative variables which are going to take up way

more time in this video first we have two different types of quantitative variables discrete and

continuous a discrete quantitative variable takes on values that are

countable in finite for example the number of goals that you can score in a soccer game well that's going to be 0 1

2 3 4 5. it might say well I guess it can be infinite you could have a million goals in a game but again realistically

no you can't so typically with a discrete quantitative variable we're thinking whole numbers only and if we

think about it you could make a list of all possible outcomes and wouldn't necessarily go on forever

whereas a continuous quantitative variable takes on values that are not countable in basically theoretically

could be infinite for example the weight of a frog if you think about the weight of a frog it really could go infinite in

either directions especially when you have a really good measuring tool because if you have a good measuring tool that maybe goes to say five decimal

places well even if you're talking about between 10 pounds and 11 pounds which actually would be a pretty big frog

let's shrink that down a little bit let's say between five and six pounds realistically right you have to

understand I hope you all know this between five and six pounds there's an infinite number of values right now even

if you say well we're only going to go to two decimal places where it's not an infinite number of values okay but there's still a lot and you wouldn't

want to sit and count them all but again hypothetically from five to six pounds there is an infinite number of

possibilities especially if you add some really precise measuring tool so discrete we're thinking countable set

number of outcomes are typically whole numbers whereas continuous we got way

too many of them to even count because we got decimals upon decimals upon decimals that make for a truly

continuous variable that can take on infinite outcomes even it's really not infinite quantitative variables can also

be analyzed into what we call a frequency table or a relative frequency table but because we don't have

categories or names we have numbers the first thing we have to do is create bins um I mean basically intervals right so

each bin or interval has to be equal in size so here we have data from a sample

of trees and from every tree we measured the tree's height and we have bins of 20

to 30 feet 30 to 40 feet and so forth these bins are what we call left-handed

bids which means you equal a number on the left and you go up to the number on the right so that first bin is for any

tree from 20 up to 29.999999 feet if a tree weighed or if

its weight if a tree had 30 feet of height it would go into the next bin so again once we set up our bins and you

can set the Ben tower but you want you can choose whatever interval you want that just has to be consistent then you just go through your data and you count

okay how many trees were 20 to 30 feet count them up and that's again the frequency or you could obviously take

that value divided by the total of 174 total trees in the sample and you can get the relative frequency as well now

there are four types of graphs that can be made from quantitative data a Dot Plot a stem and leaf plot a histogram

and a cumulative graph now let's look at our sample of 174 trees and from every

tree we measured its height which is a quantitative variable first off because it's a number technically be continuous

because the height of a tree if you got a really precise measuring tool could be any value but again you get the idea now

here is an example of a stem and leaf plot cool thing about a stem and leaf plot is you can actually see all the

individual values and they just stack up so you can see the distribution then we have a Dot Plot that puts dots for each

individual tree we could also see where they stack up we see there's a far less trees on the left far less trees on the

right most trees kind of in the middle around 80 feet then what we have is called a histogram I'll probably say that a histogram is the number one

preferred graph for quantitative data in all statistics once again across the

x-axis we see those bins or intervals 20 to 30 30 40 to 50 and then we simply

count how many trees fall into each bit and then we make a bar that goes up to that count or that frequency you could

also make it a relative frequency histogram as well where that bar goes up to the proportion instead of the count

now listen I know it looks like a bar graph It might smell like a bar graph it might even taste like a bar graph but

it's not a bar graph bar graphs are for categorical data don't ever call histogram a bar graph you'll offensive

statician somewhere the really cool thing about it whether it's a stem and leaf plot or it's a Dot Plot or if it's

a histogram is that you can see the distribution remember the distribution is what values your variable can take on

and how often it takes them on so by looking at these distributions we could clearly we see where there's less data

where there's more data what Heights are most common versus what types are least common now the fourth type of graph is

called a cumulative graph these are really cool graphs that you actually don't see too often but they're really

really valuable now here we see a bunch of dots connected by lines now every dot

has an X and it has a y for example there's a DOT at 80 on the X that's 80

feet and 0.45 on the Y now what that means is that 45 percent of all the

trees in our sample were below 80 feet so again every dot tells you the

proportion of data below that particular height now if we look in between we see

that the slopes of the lines connecting the dots are different a steeper slope simply means that there's more data in

that range so we see that there's a large amount of data from 60 to 70 and also a large amount from 70 to 80 so

that's where we see steeper lines if the line is horizontal like we see between 0 and 10 or 10 to 20 that means there is

no data in those bins whatsoever because there was no change from one to the other

these are great graphs as well to see some really important information about how the data builds up where there's a

lot of data where there's a little data all through this idea of looking at the steepness of the lines and understanding that each point tells you the proportion

of data below that particular height make sure that you know how to analyze these different graphs and be able to

answer questions about them for example if we look at the histogram I could say hey how many trees are greater than 70

feet Omni trees are less than 70 feet how many trees between 100 and 120 feet you got to be able to answer all those

questions it's pretty simple I'm going to be able to add them up make sure you get a rough count as to how many are in

each bin but also make sure if you're looking at histogram is it a frequency histogram where it shows how many trees

are in each bin or is it a relative frequency histogram where it shows What proportion or each in each bit so it's

really important to use all those kind of facts and ideas to answer questions about these different graphs but for the most part they're pretty easy questions

in this unit one of the most important things that you're going to be asked to do is to just describe the distribution

of a quantitative variable by looking at a graph now when you do this there's four things that you have to mention the

shape the center the spread in any outliers or other unusual features now when we look

at shade there's lots of different things we could say unimoto bimodal Gap clusters symmetric skewed latitude right

when we talk about the center you're looking for one value that you think best summarizes all the data split is

really analysis of how the data varies and then again outliers or data values that are very far away from all the

other values whether before the left or far to the right let's take a look at several graphs that I've made for you

that are going to only enable us to well talk about the distributions now every

single graph represents a sample of trees selected from all different parts of a force every single sample had a

roughly 174 trees and we're going to see how that sample shook out now in these first two graphs we see the shape of

symmetric but they're both symmetric in different ways now the peak graph is symmetric with most of the data in the

middle so it's going to have a smaller spread yes the overall data does go from 20 to 140 but the majority of data is

closer in the middle near the center of around 80 to 85 feet whereas the graph

on the bottom also has a center of 80 to 85 feet but that would be called bimodal because we see a big chunk of that on

the left and another big peak of data on the right now even though 80 is probably a a good center of the data it's

actually not really a good description of the data because there's actually two senders who looks like we have two clusters of data so we've got a bunch of

smaller trees subject maybe around 35 feet and a bunch of larger trees centered maybe around 120 feet this

one's going to be way more spread out it's going to vary much much more because we got so many different trees

on the left and so many different trees on the right end of the scale whereas the graph in pink has a much smaller

spread because the majority of data is all well clumped together in the middle here we see two more samples of trees

the one in purple is clearly skewed to the left where the majority of the data is on the right so the sender is probably round I don't know 120 to 110

feet and on the one in blue we see it skewed to the right which gives us the center of maybe 35 to 40 feet

now they both have similar spreads but again the majority of the data in purples at the higher end where the

majority of the data in the blue is at the lower end here we have two more graphs that are both symmetric but with

the biggest difference between these two graphs is how spread out they are the one in green is far less spun out than

the one in purple in green we have a center of 80 but it's all clustered together from 60 feet to 100 feet

whereas in purple we also have a center probably around 80 feet but it's very evenly spread from 20 all the way up to

140. when your data is very evenly spread like this we typically call it uniform in this last example we see a

very unusual feature of a huge gap we have a couple trees ranging from 20 to 40 feet at the bottom then we have an

enormous Gap where there's no trees at all and then we have a bunch of trees 80 all the way to 130 with a couple there

100 above 130 feet now here we can also say that this graph is maybe slightly skewed to the left and

again describing the sender is kind of tough because you might want to jump and say something like 70 but there's not a

single tree at 70. a better Center here would be looking at maybe 110 yes there's a couple trees in the very

bottom but typically trees in this sample are about 110 feet maybe even say 115. now we don't know for sure but

we're learning a little bit more about this in a couple moments about outliers because trees at the bottom definitely look like they could be outliers now in

any of these graphs that we've just taken a look at we've got to make sure that we describe the distribution in

context so if you go back and pause you can read my descriptions and how I give a quick explanation of the shape the

center and the spread and if there's any unusual features in every graph it really doesn't take a whole lot to

describe a distribution but you got to make sure you mention those four key details now I got to be honest when you

just have the graph of a distribution of a quantitative variable there's really not a whole lot you could say about the

distributions you kind of have to be a little bit vague but if you actually have all the individual values there's

so much more we could do let's start off by talking about measures of center here

we're talking about the mean and the median now these are both the most famous measures of center the mean is

found simply by adding all the values together and dividing by how many you have it's a pretty simple formula but

the mean is easily influenced by outliers remember the mean is trying to balance everything out and if there's

one really really large outlier the mean is going to move up a little bit because of it to keep it balanced that one large

outlier might only be one value but it weighs just as much as a bunch of the other small value now the median is

simply the middle value no matter what if you have an odd amount of data points then there is an exact median in the

middle if you have an even number of data points in AP Statistics we just take the average of the middle two

values now there is no formula to tell you what the median is you simply have to put your data in order and find the

middle but the reason there is one really cool thing you can do that's going to help you and that is by using

the formula n plus 1 divided by two this formula will not tell you what the

median is but it will tell you the location of the median if your data is in order

for example if you have 19 pieces of data 19 plus what is 20 20 divided by 2 is 10 that means that the median is the

10th value if you have 20 pieces of data 20 plus 1 is 21 divide by 2 is 10 and a

half that means that the median is located between the 10th and the 11th value so find the tens value find the

11th value and average them together to get your median now the median is not influenced by outliers because you could

have an absolutely enormous outlook on the far left or the far right and the median doesn't care at all because he or

she is just sitting pretty right in the middle that value on the left could go as far as it wants away and it's not

going to affect the mean at all but it will affect the mean now what's really important for you to know when it comes

to the mean and the median for AP Statistics is this when your data is roughly symmetric the mean and the

median will be pretty close together so even if you don't have a picture of your dad and you're like I don't know what the shape is but you do have to meet in

the media and they're really really close to each other then that's telling you that your data is symmetric when you you are skewed to the left the mean is

going to be smaller than the median when you're skewed to the right the mean is

going to be larger than the median we could actually see this pretty clearly in these four graphs and the top two

graphs are both symmetric all being in different ways but because they're symmetric the mean and the median are

going to be about the same place the arrow represents the mean and the M represents the median now the official

symbol that we have for a mean of a sample is X and bar it's x with a little

bar over top of it we don't really have any official symbol for the median we just maybe use an M or write out the

word media now when data again like I already mentioned is skewed to the left like this purple graph the mean the

arrow is going to be a little bit less than the median and when your data is skewed to the right like in blue the

mean the arrow is going to be a little bit greater than the median now let's talk about y very quickly well for

example in that blue graph yes the majority of data is at the bottom to the lower values but those higher trees

because this is our tree data even though there's only a couple limit that far right they are heavier they're

they're worth more right they're they're of bigger value to the data set and the meat has to take them into account so

even though there's only a couple of them they have more weights to them if that makes sense that's going to pull the mean higher now we also have what

are known as measures of position these are values that tell you where you are in the data now probably one of the most

famous is what's called a percentile you might hear this all the time especially working with act or ACT scores a

percentile or a particular values percentile is the percentage of data at

or below that score so for example maybe take the SAT and you find out that you scored the 95th percentile that means

that 95 percent of other students scored at your level or below which means five

percent were above you so that tells you your position in the data is pretty good you're at the high end

now we also have what's known as the first quartile the first quartile is known as the 25th percentile think of it

as the middle of the bottom half of your data 25 of data is below it 75 of data

is above it the median which we already know is the middle of our data is actually known as the 50th percentile

because 50 of data is below it 50 is above it and the third quartile also

known as Q3 is known as the 75th percentile it has 75 percent of data

below it 25 of data above it so these are just some important percentiles but really A percentile can be any value for

example the 42nd percentile has 42 percent of data at or below it but again

percentiles really specifically tell you where you fall in the data next up we have measures of spread there are three

measures of spread range which is simply your max minus your Min now that's going to be very easily influenced by Our

Lives if you haven't outlined your data it's going to make your range look huge whereas realistically the overall range

of your data might not be that big because that outlier then we have what's known as the IQR that stands for

interquartile range this is the range of the middle fifty percent of your data from Q3 to q1 so finding it's really

easy just take the third quartile and subtract the first quartile lastly we have probably the most common and most

used and most famous measures of spread the standard deviation the standard deviation is a pretty complicated

formula which you see here but honestly you're always going to use technology to find it for the most part or you'll be

given it but what's more important is you know what the standard deviation represents it represents how far

majority of data is from the mean so if you have a very large standard deviation

that tells you typically most of your data is very far from the mean whether it's above or below if you have a very

small syndication that means that most of your data is very close to the mean in the middle not too far above not too

far below now could there still be some data further and further away whether it be above or below of course but again

it's speaking to where the majority of the data Falls lastly we have outliers now when you're looking at a graph you

might just kind of vaguely say ah value looks like it could be an outlier or maybe it's not but now we have actually

specific ways to measure or determine if you have outliers in your data now there

are two of them and which one to use really depends upon what information you have if you have your quartiles then

which you could use will be called the fence method so we basically find the upper fence and the lower fence the

upper fence is found by taking Q3 the third quartile and adding 1.5 times the

IQR and if any value in your data is above that number which you just calculated as

your upper fence then it is an outlier you could have won you could have none you could have five or six who knows

to find the lower fence you take q1 the first quartile subtract 1.5 times your

IQR and that gives you your lower fence any value in your data set below that

number is considered an ally again you could have none one two more more however many you got

now the second way that you can determine outliers is using your mean and standard deviation now remember we

know that the majority of data is within one standard deviation of the mean because that's well what's typical so we

identified outliers any value that is more than two standard deviations either

above or below the mean so if you take your mean and you add two standard deviations and then you take your mean

and you subtract two standard deviations you get an interval any values in your data that's outside of that interval

would be considered outliers now I'm going to be honest with you the fence method is probably the most famous

method to find outliers but the media syndication method certainly works but again it all depends what you have if

you don't know the media standard deviation all you have is your quartiles then you're going to use the fence method if you have your mean understand

deviation then you could certainly use that method as well to determine if you have any outliers in your data now that

we've very quickly gone over all the different summer statistics let's talk about how they can be transformed if

your data is transformed now there's two different ways to transform your data first we could take every single data

value that we have and we could add a value to them all we could subtract the value to them all or we can multiply all

the values now how does that impact the different measures of Summer statistics that we just learned well addition and

subtraction affect measures of center and measures of position if you add 5 to

all your values your mean is going to go up five and your mean is going to go up five the third quartile is going to go up five the 25th percentile is going to

go up five the 42nd percentile is going to go up five but what will not change is measures of spread range standard

deviation and IQR they are not affected At All by adding or subtracting values to all of your data however if you

multiply all of your data by a specific value that will affect all measures of

Statistics that's going to affect measures of center so if you multiply all your data by 0.2 for example mean

median are going to multiply by 0.2 range iqrcentation they're going to buy it multiply by 0.2 and same with all

your measures of position basically everything will be multiplied by 0.2 now if you're going to transform them in two

ways maybe you're going to multiply and then add just note that the multiplication affects everything

measure Center measures spread and measures the position but the measures of spread will not add whatever that

constant is now the second one we could transform data is by adding data to our

data set or taking data away now it's really important if you understand that it's where that value is so if you

have a data set and you add a huge enormous outlier in the far right well your median is not going to change much

at all it might move over a little bit because you are you are adding a new data to your data set but it's not going

to change much whereas the mean is going to definitely get bigger because of that really big outlier remember the mean has

to take every values value into account if you add a value that weighs a whole lot it's going to make the mean go

higher now if you add a new value and it's just like all the other values it's kind of right in the middle then once

again your median is not going to change a whole lot and your means not going to change much either all right that's in for summary

statistics there's a lot to go on there and a lot of new things we learned but you know feel free to take the time to

make sure you review it all and that it all makes sense to you now taking together the men q1 the median Q3 and

the maximum are known as the five number summary and what we could do with the five number summary is create a box plot

which is a really cool graphical representation of our summary statistics now what we do is we make a box around

q1 and Q3 with the median somewhere in between there then in AP Statistics we

use what's called a modified box plot so first we identify outliers using our fence method we put asterisks at those

outliers then the whiskers of the box plot go to the next highest or lowest values that were not outliers here we

see an example of a box plot and the most important thing is that each section of that box plot represents 25

percent of our data now note that I have an outlier there on the far right and that that whisker went to the next value

in my data that was not deemed an outlier now the five number summary

breaks it data down to 25 chunks a wider whisker on the far right does not mean

more data it just means that that section of the data is more spread out so each chunk Below q1 in between q1 the

median in between the median and Q3 and from Q3 all the way to that outlier represents 25 percent of data wider

simply means more spread out it doesn't mean more data now the cool thing is through a box Bond you can also see the

shape you clearly see the shape of this data is skewed to the right because fifty percent of the data is towards the

bottom kind of clustered together and then the upper 50 of data is way more spread out so if you visualize that as

they skewed right graph here we see two more box plots that are symmetric this

is going back to those pink and orange graphs that were both symmetric in different ways and now you can actually see that in these box plots the first

one is spread out with some outliers on the left and out lies on the right but

we see our whiskers are about the same size that means they have about equal sprun the left and right now the median's not right smack dab in the

middle of the box and that's okay but still pretty evenly balanced which represents symmetry then the bottom

graph we see that the data is way more spread out look at that middle fifty percent in the box is way more spread

out that's because to grab the majority of the data the box has to go way to the

left and way to the right because again look at the histogram the majority of data is way to the left and way to the

right so the middle fifty percent is going to be way wider to capture that data now that we've learned all the

different summary statistics for a quantitative variable we can see how they all kind of fit together and really tell us a lot about the data and one

thing that the AP statistics Exam loves to do is give you a set of summary statistics and have you complete some

tasks with it so here we're going to take a look at another set of 174 trees where the heights of each tree was

measured now across the top we see the summary statistic 6 the mean the median Min q1 Q3 the max of standard deviation

and the first thing I noticed is that the mean is lower than the median so the data has a shape that is skewed left

also the median is closer to the third quartile than it is the first quartile now what that means is that because

there is more distance between the first quartile and the median does not mean there's more data it just means that

section that is more spread out we also notice that the third quartile is closer to the max than the first quartile is

closer to the mean meaning that the distance between the first quartile and the Min is extremely far which again is

showing that that side of the data the left side of the data is more spread out all signs point to the bottom 50 of the

data being more spread out than the top 50 percent which makes our data skewed to the left another very common question

has you analyze the standard deviation the standard deviation tells us that the majority of trees in this sample are

within 28.96 feet of the mean of 104.82 feet remember the standard deviation

tells you how far typical data is from the mean and within means plus or minus so if we take our mean and we add 28.96

we subtract 28.96 that tells us where the majority of our data Falls now that

standard deviation is kind of large to be quite honest which is again another sign that the data is fairly spread out

now they also love asking you to talk about outlier so remember we have two

different outlier formulas in red I have defense method here we're taking the uh

third quartile 125 we're adding 1.5 times the IQR which is Q3 minus q1 and

we get 185. now the first thing I notice is the max is only 135 which means that

there is obviously no values bigger than 185 so there's no upper outliers now the lower fence is q1 85 minus 1.5 times the

rqr and we get 25 here now the Min is 22 which is below 25 so we for sure know

that we have at least one outlier the 22 foot tree but the idea here is without

knowing every single individual data point there could be more outliers like for example there could be a tree that's

23 or a tree that's 24 feet that would again be below 25 but we don't know all

those values we only know the Min so that's why it's important to make sure we emphasize that there's only at least one outlier that could be more but

without the data we don't know we can also use our mean incident deviation formula by taking the mean adding 2 and

subtracting two standard deviations here we get an interval where we know a large large majority of our data Falls and any

values outside this would be deemed outliers now the top of that interval is 162.74 feet and again with our Max of

135 there's clearly no outliers on that top end but here we have a 46.9 the low

end which would tell us that any tree including that Min of 22 would definitely be considered an outlier now

here we see a box plot that I actually created for this data now if you're asked to make a modified box plot you do

need to show the outliers so I actually had all this data and there was one

other tree of 23 feet so that's why you see two dots there 22 and 23 and then that whisper goes to the next value it

looks to be about 30 that was not an outlier now again we see where the q1 the median Q3 and that max value fall

now sometimes an AP exam if you don't actually have all the data you only have your five number summary they'll try to

just make a regular box plot not showing any outliers because you could certainly do that with the five number summary

alone all right that's it for this example hopefully that made a lot of sense now another really important task that very

often comes up on the AP stats exam whether it be to multiple choice or an frq is comparing two different

distributions maybe we have two histograms two box plots um or even two stem and leaf plots which

we could call back to back cinnamon leaf plot but through any of these things we want to make sure we compare and when we

compare please make sure that we use comparative language like greater than less than um bigger smaller higher lower all those

different things or even a they're just flat out the same now when you're comparing we want to compare the centers

we want to compare the shapes we want to compare the spreads we want to compare the presence or absence of outliers

let's take a look at an example here we see what we call parallel box plots there are two box plots that are

parallel and on the same x-axis now oftentimes we're going to be asked to do is compare so the top is trees from the

west side of the forest and the bottom box plus trees from the East side so what could we say about the shapes well

we'd say they're both approximately skewed to the right we see at the bottom fifty percent on both graphs is well

less spread out than the upper fifty percent on both graphs so they're both a little bit skewed to the right we also

identify that neither graph has any outliers and then we also could look at the medians the median for the East trees is

20 feet where the median for the West trees is 33 so it clearly has a higher Center

we could also look at the middle 50 percent the IQR for the top West trees

is way more spread out than the IQR or the middle fifty percent for the East trees so when we're looking at these

different graphs we want to talk about shape maybe it's the same in this case you could write Center the median is

higher for one than the other and spread as well being able to compare two distributions really is vital it comes

up almost every single year on the frq section of the AP exam so make sure you take your time with it use comparative

language and speak in context don't just say oh the one has a center of 33 and

the other has a center of 20. 33 what fish inches centimeters seconds no way

trees from the East are a little bit taller than trees from the West one does the center of around 33 feet one has a

center around 20 feet use things like that to make sure you speak in context especially when you're comparing two

distributions in this last section of unit one things take up pretty cool crazy twists now here's the deal some

sets of data can be modeled with what we call a density curve a density curve is

used to model a set of data to give us some insight as to what the population of that sample data came from could

possibly look like some sets of data can be described as approximately normally distributed this is the most famous type

of density curve there is the normal distribution now the normal distribution is unimodal Mount shaped and symmetric

and it can be described with the parameters of the population mean and the population standard deviation so

here we see that normal curve again Mount shaped and symmetric right smack dab of the middle was the mean and then

as we move to the right we go up one up two up three simulations down one down two down three standard deviations now

the normal model is used for continuous quantitative variables which again remember have infinite possibilities all

the way up towards positive infinity and all the way down towards negative Infinity so why do we stop the normal model at three some deviations above the

mean and three simulations below the mean because honestly a huge chunk of data is within three standard deviations

if it's normally distributed there's just very little data above three standard deviations or below three simulations for us to even worry about

please note that not all data sets follow a normal distribution furthermore a sample might look unimodal Mound

shaped and symmetric and you might want to say that it is a normal distribution but remember only a population could

have been officially modeled with a normal distribution now here's what's really cool about normal distribution is

that they're actually very predictable we know that 68 of data in a population

is within 1 standard deviation of the mean 95 percent of data within a population

is within two standard deviations of the mean and 99.7 percent is within three standard

deviations of the mean that is why even though a normal distribution is continuous all the way down towards

negative infinity and up towards positive Infinity we usually stop drawing it at negative three in positive

three standard deviations because it's so unlikely for data to be outside of that most data pretty much all 99.7 is

within three standard deviations of the me we actually call this the empirical rule let's say that a large Force has trees

that do in fact follow a normal distribution when it comes to their heights they would have a mean of 80

feet and a standard deviation of 18 feet here is what that normal distribution would look like in this scenario now

remember tree height is a continuous quantitative variable so technically the

height of a tree can be anything as low as negative infinity or as high as positive Infinity if you want to look at it that way but when we draw the normal

model there is no reason for us to go below 26 feet or above 134 feet because that is three standard deviations above

and three simulations below the mean which is where 99.7 trees in this Force are going to fall anyway now the formula

for standardized scores or Z scores is actually really really simple to find a z-score you simply take your individual

value in this case a tree hide subtract the mean mu and divide by the standard deviation Sigma I just want to make sure

I emphasize that when you're going to use a calculator to do this do the numerator first and then hit enter

divide by the standard deviation now once once again a z-score measures how many standard deviations above or below

the mean U could be so Z scores can be negative Z scores could be positive but again don't go back and forget the idea

that I said most data is within three so getting e z score of negative four

negative five positive seven positive 18. those are extremely crazy z-scores

because most data will fall within three standard deviations of your mean here we see a standard normal model which only

is labeled by the Z scores zero in the middle because the mean is zero centimations from itself then we go up

one up two up three simulations down one down two down three simulations now what's really cool about the standard

normal model and Z scores it allows us to compare anything so for example you

might think it'd be impossible to compare the height of the tree to the weight of a bear but if you standardize their scores giving the Z scores for a

tree and the z-score for a bear putting them on to the same standard normal model then you could really figure out

oh that bear has a z-score 1.3 where that tree only has a z-score of 0.9 clearly that is a bigger bear so even

though bears and trees are things that you wouldn't seemingly ever compare you actually can if you standardize their

scores so let's go back to our 100 foot tree question the first thing we can do is standardize the score for a 100 foot

tree so we're going to take a hundred subtract 80 divide by the simulation to get a z-score of 1.11 so we see that

spot indicated on our standard normal that is where 100 foot tree is because it's 1.1 standard deviations above the

mean so one question we could ask is What proportion of trees are below 100 feet so now that we have the z-score for

100 feet we could again use technology so here I'm showing you how to use a TI-84 calculator you're going to hit

second vars to go to normalcdf the lower value is negative 99 that's essentially acting as a negative Infinity we don't

have an Infinity button on the calculator so we're just going to extremely low z-score and the upper

value is going to be that 1.11 again it works left to right lower left upper right so if we're looking at the Shaded

region trees below 100 feet or below a z-score of 1.11 we're going to start at negative 99 way down below and go up to

negative 1.11 and the 1084 calculator tells us 0.867 so 86.7 of trees fall

into that range that are below 100 feet as long as it's a normal distribution

now we could also use Desmos Desmos makes it pretty easy to do normal distributions with the command here it

is we're just looking at negative Infinity to a z or to a z-score at the top Max of 1.11 and we also get 86.7

percent of trees going to below 100 feet and here's an example of what we use one of those standard normal tables now a

lot of teachers might not even teach us anymore because there's a little bit old school but there are tables where you are actually going to look up your

z-score and inside the table it gives you the proportion that is below that from a standard normal table

so on the left side we look up the first decimal place that's the 1.1 so we have the one and then the point one and then

across the top we find that second decimal which is the also A1 in this case we're going to go to the 0.01

column so in total that'd be 1.11 and we just cross that row and column together

and we get 0.867 once again telling us that 86.7 per tree uh percent of trees

in this Force are below 100 feet we could also use this exact same procedure to find the proportion of trees that are

greater than 100 feet again it's the same z-score 1.11 when we go to our td4 calculator now the lower value is going

to be 1.11 and the upper value is going to be 99 we could also use decimals or we could use a standard normal table

just be careful standard normal tables only give you the proportion below the

particular z-score that you look up so if the question is asking about going greater than 1.11 you have to First Look

up the value in the Z table that represents below and then simply take one minus that proportion and then

you're going to get the opposite of it which would clearly be the proportion of trees above 100 feet or a z-score above

1.11 either way we get the same proportion for trees that are greater than 100 feet we can even find the

proportion of trees that are between 70 and 100 feet we got to get the z-score for both 70 and the z-score for 100 then

we could use normalcdf on our t84 calculator to look in between those TZ scores or we can even use Desmos to look

in between those TZ scores as well once again you could also use the standard normal table just involves a little bit

more work because you have to look at the proportion of data below the higher z-score then you got to look up the

proportion of data below the smaller z-score subtract them to get the proportion in between most people don't

use standard normal model tables anymore but if you're trained on how to use them it is still pretty easy the normal distribution even allows us to work

backwards to solve some really cool problems here we could be given the area or the proportion under a center normal

curve and what we can do is use technology or send an normal table to actually find the z-score that

represents that particular area let's look at this to an example in the forest with trees whose Heights fall a standard

normal distribution what they mean at 80 feet and a standard deviation of 18 feet what height would Mark the 80th

percentile so remember what a percentile is it's the percentage of trees at a particular value so what we're asking

here is what height of a tree would represent the position that is 80 percent of trees less than it which

would simultaneously mean 20 above it we could use technology or extend a normal table to get us the z-score that

represents this position with 80 percent below it here's how it works on your

TI-84 calculator you're going to use the command invert Norm now in the command for invert Norm you have to ask the area

the area you're looking for the area that you type in is the proportion below the area below or the area to the left

so I'm going to type in 0.8 because that's what we're trying to find the z-score that has 80 percent to the left

of it or below it and when we use the infer Norm command we get that z-score of 0.842 now we could also use a

standard normal table what we have to do is actually use it in reverse so we're going to actually look inside the table and find approximately 0.80 now we don't

actually see it exactly but we see two numbers that are really close we see 0.7995 in 0.8023 to be honest you could

probably use either one of them and be okay but technically 0.7995 is really really close to 80 percent or 0.8 so the

z-score that represents that 0.7995 below it is 0.84

now I'm going to be a little bit more specific and use the 0.842 for my calculator now we're going to do is

we're going to take our z-score formula we know the mean is 80. we know this is Aviation 18 but now we know the z-score

that represents this 80th percentile is 0.842 then we can just work backwards

multiply the standard deviation over that's 18 and then add the 80. and this gives us the height of a tree that would

represent the 80th percentile in this case we've got 95.156 feet so 80 percent of trees in

the force are below 95 feet and 20 percent are above it another example might ask us something like what tree

height represents the top five percent of all trees in the force so once again we got to get some technology here to

figure out what Z score represents that top five percent but keep in mind that when you're talking about the top five

percent you're simultaneously talking about the bottom 95 below it it's the same value so once again we could go to

our TI-84 calculator and we could type in 0.95 because that's the area to the

left or the area below which again represents 95 below but the same thing is saying five percent above and we get

a z-score of 1.645 we can all see user standard normal table in look up 0.95

once again this is one of the drawbacks of using the tables is you're not going to find it precisely but we can get

pretty close here and we see about 0.9495 or 0.9505 both are equally as

close to 0.95 so that's the z-score of 1.64 now again I'm always going to be

trying to be using technology to be a little bit more accurate here so I'm going to go with the 1.645 as that z-score once again substituting that in

for Z I know the mean 80 I know this animation of 18 multiply this integration over add 80 and I get the

height of a tree that represents that 95th percentile so five percent increase in the force are taller than 109.61 feet

and 95 percent of trees are below that now I have to be honest with you there are so many more normal distribute

calculation problems that can be done other than the ones I just went over in fact there's a huge plethora of

different types of problems that all involve the normal distribution some are really easy like the ones we talked about in this video and some can be much

more complex so here's the deal I don't have time in this review video to go over all of them but what you can do is

visit my YouTube channel where I have a playlist for the normal distribution I have tons of videos to go over all the

different types of problems that could come up when you're doing normal distribution calculations they could be

really fun in my opinion but they could also be a little bit challenging which hopefully makes it fun but please check

out my YouTube playlist where you can learn much much more about the normal distribution and all the different calculations that can be done with it

all right well that's a wrap on unit one exploring one variable data it's a pretty thick unit with lots of

information in it so hopefully I didn't review it too fast but please take a look at that study guide through the

Ultra review packet you also have the answer key in just doing that study guide checking out your answers is going

to really help prepare you not just for the unit 1 you have in class but it's also going to really help prepare you for the AP stats exam in may now unit

one really kind of sets the foundation for the entire course because everything starts with analyzing data if you know

how to analyze data talk about data understand summary statistics now everything ties together it's going to

really help prepare you for everything else in this course so best of luck hopefully you learned a lot I can't wait

to see you in the next video

English (auto-generated)