what's up my stat Stars welcome to AEP statistics unit 1 summary video in this
video we're going to go all the way through unit 1 exploring one variable data talking about all the major themes
and all the major Concepts to make sure that you are ready either for your unit one test or to help prepare you for the
AP test in may now before we begin I want to mention two really really important things first this is just a
review video we're not going to cover every single teeny tiny Topic in extreme detail that's what a class was for the
purpose of this video is to take everything that your teacher threw into the last couple weeks and put it into one digestible video that kind of covers
the big major themes of it all now if you are looking for much more specific videos that cover every single topic in
unit 1 and all the other units of AP Statistics please check out my YouTube channel I got videos for every single
topic explaining everything in much more detail in this review video or if you're really looking for a lot of great
information that can help you prepare for your unit test or the AP test please check out the all some review packing
using the link in the description at the ultimate review packet you can get a free trial to take a look at every
single unit you get study guides practice sheets practice multiple choice and you also get these awesome review
videos as well and the best part is you even get answer keys to study guides and
those practice sheets to make sure that you're doing everything okay at the very very end you can even do a full length
practice AP exam and the second thing I want to mention is Yes you heard me right study guide while you're at the
ultimate review packet please make sure to download my study guide for unit one I also got study guides for all the
other units and you can use that study guide while you watch this video you could pause fill Parts in hit play pause
fill some more parts in or you can watch the whole video and fill it all out at the end but the best thing is you got
access to that answer key so you can check out all the answers at the end and make sure that you're doing everything okay and if you even want more practice
to prepare you for the exam you can also check out my practice sheets alright let's get into you know what
foreign is all about exploring one variable data
we're really going to learn how to analyze one variable or how to take one variable and compare it across multiple
samples or multiple groups now listen understanding how to analyze data is super important may seem kind of boring
and not that fun at the beginning but what we need to do with analyzing data later on in statistics is so crucial to
the really the big important Concepts that are probably going to be the most challenging for you so if you understand how to analyze that and now it's going
to pay off big time at the end when we do some really important stuff now listen this unit is really broken down
into two things categorical data and quantitative data and I'm not going to lie to you categorical data is way
easier way faster way shorter in fact only a small percentage of this entire unit is even about categorical data much
much bigger part of the unit is over quantitative data but regardless of categorical or quantitative variables there's something really important that
you need to understand anytime you select a sample and from that sample you collect data any summary information
that you learn from that sample data is called a statistic whereas if you
collect information if you collect data from an entire population then anything you learn from that population is called
a parameter it's really easy to memorize these things because it basically here's the idea statistics starts with an S and
so does samples and statistics come from samples parameter starts with the p and parameters come from populations which
also serve to the P so it's pretty easy to remember that concept now we collect data from individuals and individuals
can be well honestly anything it can be a person it could be a chair it can be a tree can be a lake it could be a state
it could be a country it could be a day for that all that matters really an individual can be anything now here's
the most important part a variable is any characteristic that can change from
one individual to another so if you just think about a person or maybe multiple people think about any characteristic
that can change from one to another eye color hair color weight height just to name a few now the reason why we like
analyzing data so much is because individuals vary if individuals didn't vary well then Artisan wouldn't even
need this course and the world would be a pretty boring place now here's the deal with variables there's only two
types all variables in the world can be categorized into two types either categorical variables or quantitative
variables a categorical variable takes on values that are category names or
group labels like eye color or hair color whereas a quantitative variable takes on numerical values that are
either measured or counted like the weight of a frog or how many candies are in a bag to try to keep it really simple
a categorical variable value is simply going to be a word whereas a quantitative variable value is typically
going to be a number now there are a couple exceptions to that rule namely zip code ZIP code is a number but it's
not measured and it's not counted that doesn't make it quantitative a zip code is simply a number that tells your male
where to go which means it simply puts your mail into a specific category for your City's post office so that's why
zip code is one of those weird exceptions that's a number but technically categorical variable but to be honest in most cases it's pretty
straightforward categorical variables are words quantitative variables are numbers let's start off with categorical
data because it really is shorter and much faster to talk about there's just not a whole lot there now let's say that
we take a sample of 89 lemurs and one of the variables that we want to analyze from those lemurs is the type of lemur
it is whether it's a cystica an II a ringtail or a mouse limb probably pronouncing some of those wrong but
again those are all words which makes this a categorical variable now if we just have all that data collected it's
probably going to be a really long boring list of all those different categories so the first thing we'd like to do is organize that into what we call
a frequency table frequency is just a fancy word for cows here we list each of the categories and we simply count how
many under lemurs fit into each of those categories now we can also take a look at what's called the relative frequency
the relative frequency is just the proportion of lemurs that fell into each category so for example we take the
number of ring tail lemurs that we have we divide by 89 and we get the proportion now keep in mind that a
relative frequency a percentage or a rate all tell us the exact same information that a proportion does
they're all really basically the same thing however we really do like using proportions I'm not trying to say that
we're never going to use frequencies at all but we like relative frequencies a lot because when we were comparing two
samples especially two samples in different sizes using relative frequencies is a much more fair way to
compare them when it comes to making graphs of categorical data we really have two options pie charts or what some
people call Circle graphs and bar graphs now a bar graph could also be turned into what's called a relative bar graph
so instead of the heights of each bar showing the frequency or the number of lemurs that fall into each category it
simply shows the proportion whereas a circle graph only shows portions because the idea is each slice is a proportion
of the whole circle now when we look at a pie chart or a bar graph one thing that you might be asked to do is to
describe the distribution of that variable now what is a distribution because that's a really important word for this entire unit a distribution of
data is basically what values that that data takes on and how often it takes on
those values so if we're asked to talk about the distribution categorical data really all we could say is maybe which
category had the most which category had the least and maybe we could even mention all the different categories that are even available to us but
there's not a whole lot we can say oftentimes the best things we can do with either a bar graph or a pie chart
is compare two different samples so for example here we see a pie chart for the Lemurs in Force One a pie chart for the
Lemurs in Force 2 and because pie charts are based on proportions it's really easy to see some important differences
like we noticed in Force 2 there's a much higher proportion of Cisco's than there is in force one and we simply know
that by just seeing that the piece of that pie is much bigger in 4 or student Force One now what it's going to be
expected of you to answer questions about on the AP exam when it comes to categorical variables is really again like I said just describe the
distribution reading a bar graph also noticing if it's a relative bar graph so you can see what proportion or what
percentage of data falls into each category now let's move on the quantitative variables which are going to take up way
more time in this video first we have two different types of quantitative variables discrete and
continuous a discrete quantitative variable takes on values that are
countable in finite for example the number of goals that you can score in a soccer game well that's going to be 0 1
2 3 4 5. it might say well I guess it can be infinite you could have a million goals in a game but again realistically
no you can't so typically with a discrete quantitative variable we're thinking whole numbers only and if we
think about it you could make a list of all possible outcomes and wouldn't necessarily go on forever
whereas a continuous quantitative variable takes on values that are not countable in basically theoretically
could be infinite for example the weight of a frog if you think about the weight of a frog it really could go infinite in
either directions especially when you have a really good measuring tool because if you have a good measuring tool that maybe goes to say five decimal
places well even if you're talking about between 10 pounds and 11 pounds which actually would be a pretty big frog
let's shrink that down a little bit let's say between five and six pounds realistically right you have to
understand I hope you all know this between five and six pounds there's an infinite number of values right now even
if you say well we're only going to go to two decimal places where it's not an infinite number of values okay but there's still a lot and you wouldn't
want to sit and count them all but again hypothetically from five to six pounds there is an infinite number of
possibilities especially if you add some really precise measuring tool so discrete we're thinking countable set
number of outcomes are typically whole numbers whereas continuous we got way
too many of them to even count because we got decimals upon decimals upon decimals that make for a truly
continuous variable that can take on infinite outcomes even it's really not infinite quantitative variables can also
be analyzed into what we call a frequency table or a relative frequency table but because we don't have
categories or names we have numbers the first thing we have to do is create bins um I mean basically intervals right so
each bin or interval has to be equal in size so here we have data from a sample
of trees and from every tree we measured the tree's height and we have bins of 20
to 30 feet 30 to 40 feet and so forth these bins are what we call left-handed
bids which means you equal a number on the left and you go up to the number on the right so that first bin is for any
tree from 20 up to 29.999999 feet if a tree weighed or if
its weight if a tree had 30 feet of height it would go into the next bin so again once we set up our bins and you
can set the Ben tower but you want you can choose whatever interval you want that just has to be consistent then you just go through your data and you count
okay how many trees were 20 to 30 feet count them up and that's again the frequency or you could obviously take
that value divided by the total of 174 total trees in the sample and you can get the relative frequency as well now
there are four types of graphs that can be made from quantitative data a Dot Plot a stem and leaf plot a histogram
and a cumulative graph now let's look at our sample of 174 trees and from every
tree we measured its height which is a quantitative variable first off because it's a number technically be continuous
because the height of a tree if you got a really precise measuring tool could be any value but again you get the idea now
here is an example of a stem and leaf plot cool thing about a stem and leaf plot is you can actually see all the
individual values and they just stack up so you can see the distribution then we have a Dot Plot that puts dots for each
individual tree we could also see where they stack up we see there's a far less trees on the left far less trees on the
right most trees kind of in the middle around 80 feet then what we have is called a histogram I'll probably say that a histogram is the number one
preferred graph for quantitative data in all statistics once again across the
x-axis we see those bins or intervals 20 to 30 30 40 to 50 and then we simply
count how many trees fall into each bit and then we make a bar that goes up to that count or that frequency you could
also make it a relative frequency histogram as well where that bar goes up to the proportion instead of the count
now listen I know it looks like a bar graph It might smell like a bar graph it might even taste like a bar graph but
it's not a bar graph bar graphs are for categorical data don't ever call histogram a bar graph you'll offensive
statician somewhere the really cool thing about it whether it's a stem and leaf plot or it's a Dot Plot or if it's
a histogram is that you can see the distribution remember the distribution is what values your variable can take on
and how often it takes them on so by looking at these distributions we could clearly we see where there's less data
where there's more data what Heights are most common versus what types are least common now the fourth type of graph is
called a cumulative graph these are really cool graphs that you actually don't see too often but they're really
really valuable now here we see a bunch of dots connected by lines now every dot
has an X and it has a y for example there's a DOT at 80 on the X that's 80
feet and 0.45 on the Y now what that means is that 45 percent of all the
trees in our sample were below 80 feet so again every dot tells you the
proportion of data below that particular height now if we look in between we see
that the slopes of the lines connecting the dots are different a steeper slope simply means that there's more data in
that range so we see that there's a large amount of data from 60 to 70 and also a large amount from 70 to 80 so
that's where we see steeper lines if the line is horizontal like we see between 0 and 10 or 10 to 20 that means there is
no data in those bins whatsoever because there was no change from one to the other
these are great graphs as well to see some really important information about how the data builds up where there's a
lot of data where there's a little data all through this idea of looking at the steepness of the lines and understanding that each point tells you the proportion
of data below that particular height make sure that you know how to analyze these different graphs and be able to
answer questions about them for example if we look at the histogram I could say hey how many trees are greater than 70
feet Omni trees are less than 70 feet how many trees between 100 and 120 feet you got to be able to answer all those
questions it's pretty simple I'm going to be able to add them up make sure you get a rough count as to how many are in
each bin but also make sure if you're looking at histogram is it a frequency histogram where it shows how many trees
are in each bin or is it a relative frequency histogram where it shows What proportion or each in each bit so it's
really important to use all those kind of facts and ideas to answer questions about these different graphs but for the most part they're pretty easy questions
in this unit one of the most important things that you're going to be asked to do is to just describe the distribution
of a quantitative variable by looking at a graph now when you do this there's four things that you have to mention the
shape the center the spread in any outliers or other unusual features now when we look
at shade there's lots of different things we could say unimoto bimodal Gap clusters symmetric skewed latitude right
when we talk about the center you're looking for one value that you think best summarizes all the data split is
really analysis of how the data varies and then again outliers or data values that are very far away from all the
other values whether before the left or far to the right let's take a look at several graphs that I've made for you
that are going to only enable us to well talk about the distributions now every
single graph represents a sample of trees selected from all different parts of a force every single sample had a
roughly 174 trees and we're going to see how that sample shook out now in these first two graphs we see the shape of
symmetric but they're both symmetric in different ways now the peak graph is symmetric with most of the data in the
middle so it's going to have a smaller spread yes the overall data does go from 20 to 140 but the majority of data is
closer in the middle near the center of around 80 to 85 feet whereas the graph
on the bottom also has a center of 80 to 85 feet but that would be called bimodal because we see a big chunk of that on
the left and another big peak of data on the right now even though 80 is probably a a good center of the data it's
actually not really a good description of the data because there's actually two senders who looks like we have two clusters of data so we've got a bunch of
smaller trees subject maybe around 35 feet and a bunch of larger trees centered maybe around 120 feet this
one's going to be way more spread out it's going to vary much much more because we got so many different trees
on the left and so many different trees on the right end of the scale whereas the graph in pink has a much smaller
spread because the majority of data is all well clumped together in the middle here we see two more samples of trees
the one in purple is clearly skewed to the left where the majority of the data is on the right so the sender is probably round I don't know 120 to 110
feet and on the one in blue we see it skewed to the right which gives us the center of maybe 35 to 40 feet
now they both have similar spreads but again the majority of the data in purples at the higher end where the
majority of the data in the blue is at the lower end here we have two more graphs that are both symmetric but with
the biggest difference between these two graphs is how spread out they are the one in green is far less spun out than
the one in purple in green we have a center of 80 but it's all clustered together from 60 feet to 100 feet
whereas in purple we also have a center probably around 80 feet but it's very evenly spread from 20 all the way up to
140. when your data is very evenly spread like this we typically call it uniform in this last example we see a
very unusual feature of a huge gap we have a couple trees ranging from 20 to 40 feet at the bottom then we have an
enormous Gap where there's no trees at all and then we have a bunch of trees 80 all the way to 130 with a couple there
100 above 130 feet now here we can also say that this graph is maybe slightly skewed to the left and
again describing the sender is kind of tough because you might want to jump and say something like 70 but there's not a
single tree at 70. a better Center here would be looking at maybe 110 yes there's a couple trees in the very
bottom but typically trees in this sample are about 110 feet maybe even say 115. now we don't know for sure but
we're learning a little bit more about this in a couple moments about outliers because trees at the bottom definitely look like they could be outliers now in
any of these graphs that we've just taken a look at we've got to make sure that we describe the distribution in
context so if you go back and pause you can read my descriptions and how I give a quick explanation of the shape the
center and the spread and if there's any unusual features in every graph it really doesn't take a whole lot to
describe a distribution but you got to make sure you mention those four key details now I got to be honest when you
just have the graph of a distribution of a quantitative variable there's really not a whole lot you could say about the
distributions you kind of have to be a little bit vague but if you actually have all the individual values there's
so much more we could do let's start off by talking about measures of center here
we're talking about the mean and the median now these are both the most famous measures of center the mean is
found simply by adding all the values together and dividing by how many you have it's a pretty simple formula but
the mean is easily influenced by outliers remember the mean is trying to balance everything out and if there's
one really really large outlier the mean is going to move up a little bit because of it to keep it balanced that one large
outlier might only be one value but it weighs just as much as a bunch of the other small value now the median is
simply the middle value no matter what if you have an odd amount of data points then there is an exact median in the
middle if you have an even number of data points in AP Statistics we just take the average of the middle two
values now there is no formula to tell you what the median is you simply have to put your data in order and find the
middle but the reason there is one really cool thing you can do that's going to help you and that is by using
the formula n plus 1 divided by two this formula will not tell you what the
median is but it will tell you the location of the median if your data is in order
for example if you have 19 pieces of data 19 plus what is 20 20 divided by 2 is 10 that means that the median is the
10th value if you have 20 pieces of data 20 plus 1 is 21 divide by 2 is 10 and a
half that means that the median is located between the 10th and the 11th value so find the tens value find the
11th value and average them together to get your median now the median is not influenced by outliers because you could
have an absolutely enormous outlook on the far left or the far right and the median doesn't care at all because he or
she is just sitting pretty right in the middle that value on the left could go as far as it wants away and it's not
going to affect the mean at all but it will affect the mean now what's really important for you to know when it comes
to the mean and the median for AP Statistics is this when your data is roughly symmetric the mean and the
median will be pretty close together so even if you don't have a picture of your dad and you're like I don't know what the shape is but you do have to meet in
the media and they're really really close to each other then that's telling you that your data is symmetric when you you are skewed to the left the mean is
going to be smaller than the median when you're skewed to the right the mean is
going to be larger than the median we could actually see this pretty clearly in these four graphs and the top two
graphs are both symmetric all being in different ways but because they're symmetric the mean and the median are
going to be about the same place the arrow represents the mean and the M represents the median now the official
symbol that we have for a mean of a sample is X and bar it's x with a little
bar over top of it we don't really have any official symbol for the median we just maybe use an M or write out the
word media now when data again like I already mentioned is skewed to the left like this purple graph the mean the
arrow is going to be a little bit less than the median and when your data is skewed to the right like in blue the
mean the arrow is going to be a little bit greater than the median now let's talk about y very quickly well for
example in that blue graph yes the majority of data is at the bottom to the lower values but those higher trees
because this is our tree data even though there's only a couple limit that far right they are heavier they're
they're worth more right they're they're of bigger value to the data set and the meat has to take them into account so
even though there's only a couple of them they have more weights to them if that makes sense that's going to pull the mean higher now we also have what
are known as measures of position these are values that tell you where you are in the data now probably one of the most
famous is what's called a percentile you might hear this all the time especially working with act or ACT scores a
percentile or a particular values percentile is the percentage of data at
or below that score so for example maybe take the SAT and you find out that you scored the 95th percentile that means
that 95 percent of other students scored at your level or below which means five
percent were above you so that tells you your position in the data is pretty good you're at the high end
now we also have what's known as the first quartile the first quartile is known as the 25th percentile think of it
as the middle of the bottom half of your data 25 of data is below it 75 of data
is above it the median which we already know is the middle of our data is actually known as the 50th percentile
because 50 of data is below it 50 is above it and the third quartile also
known as Q3 is known as the 75th percentile it has 75 percent of data
below it 25 of data above it so these are just some important percentiles but really A percentile can be any value for
example the 42nd percentile has 42 percent of data at or below it but again
percentiles really specifically tell you where you fall in the data next up we have measures of spread there are three
measures of spread range which is simply your max minus your Min now that's going to be very easily influenced by Our
Lives if you haven't outlined your data it's going to make your range look huge whereas realistically the overall range
of your data might not be that big because that outlier then we have what's known as the IQR that stands for
interquartile range this is the range of the middle fifty percent of your data from Q3 to q1 so finding it's really
easy just take the third quartile and subtract the first quartile lastly we have probably the most common and most
used and most famous measures of spread the standard deviation the standard deviation is a pretty complicated
formula which you see here but honestly you're always going to use technology to find it for the most part or you'll be
given it but what's more important is you know what the standard deviation represents it represents how far
majority of data is from the mean so if you have a very large standard deviation
that tells you typically most of your data is very far from the mean whether it's above or below if you have a very
small syndication that means that most of your data is very close to the mean in the middle not too far above not too
far below now could there still be some data further and further away whether it be above or below of course but again
it's speaking to where the majority of the data Falls lastly we have outliers now when you're looking at a graph you
might just kind of vaguely say ah value looks like it could be an outlier or maybe it's not but now we have actually
specific ways to measure or determine if you have outliers in your data now there
are two of them and which one to use really depends upon what information you have if you have your quartiles then
which you could use will be called the fence method so we basically find the upper fence and the lower fence the
upper fence is found by taking Q3 the third quartile and adding 1.5 times the
IQR and if any value in your data is above that number which you just calculated as
your upper fence then it is an outlier you could have won you could have none you could have five or six who knows
to find the lower fence you take q1 the first quartile subtract 1.5 times your
IQR and that gives you your lower fence any value in your data set below that
number is considered an ally again you could have none one two more more however many you got
now the second way that you can determine outliers is using your mean and standard deviation now remember we
know that the majority of data is within one standard deviation of the mean because that's well what's typical so we
identified outliers any value that is more than two standard deviations either
above or below the mean so if you take your mean and you add two standard deviations and then you take your mean
and you subtract two standard deviations you get an interval any values in your data that's outside of that interval
would be considered outliers now I'm going to be honest with you the fence method is probably the most famous
method to find outliers but the media syndication method certainly works but again it all depends what you have if
you don't know the media standard deviation all you have is your quartiles then you're going to use the fence method if you have your mean understand
deviation then you could certainly use that method as well to determine if you have any outliers in your data now that
we've very quickly gone over all the different summer statistics let's talk about how they can be transformed if
your data is transformed now there's two different ways to transform your data first we could take every single data
value that we have and we could add a value to them all we could subtract the value to them all or we can multiply all
the values now how does that impact the different measures of Summer statistics that we just learned well addition and
subtraction affect measures of center and measures of position if you add 5 to
all your values your mean is going to go up five and your mean is going to go up five the third quartile is going to go up five the 25th percentile is going to
go up five the 42nd percentile is going to go up five but what will not change is measures of spread range standard
deviation and IQR they are not affected At All by adding or subtracting values to all of your data however if you
multiply all of your data by a specific value that will affect all measures of
Statistics that's going to affect measures of center so if you multiply all your data by 0.2 for example mean
median are going to multiply by 0.2 range iqrcentation they're going to buy it multiply by 0.2 and same with all
your measures of position basically everything will be multiplied by 0.2 now if you're going to transform them in two
ways maybe you're going to multiply and then add just note that the multiplication affects everything
measure Center measures spread and measures the position but the measures of spread will not add whatever that
constant is now the second one we could transform data is by adding data to our
data set or taking data away now it's really important if you understand that it's where that value is so if you
have a data set and you add a huge enormous outlier in the far right well your median is not going to change much
at all it might move over a little bit because you are you are adding a new data to your data set but it's not going
to change much whereas the mean is going to definitely get bigger because of that really big outlier remember the mean has
to take every values value into account if you add a value that weighs a whole lot it's going to make the mean go
higher now if you add a new value and it's just like all the other values it's kind of right in the middle then once
again your median is not going to change a whole lot and your means not going to change much either all right that's in for summary
statistics there's a lot to go on there and a lot of new things we learned but you know feel free to take the time to
make sure you review it all and that it all makes sense to you now taking together the men q1 the median Q3 and
the maximum are known as the five number summary and what we could do with the five number summary is create a box plot
which is a really cool graphical representation of our summary statistics now what we do is we make a box around
q1 and Q3 with the median somewhere in between there then in AP Statistics we
use what's called a modified box plot so first we identify outliers using our fence method we put asterisks at those
outliers then the whiskers of the box plot go to the next highest or lowest values that were not outliers here we
see an example of a box plot and the most important thing is that each section of that box plot represents 25
percent of our data now note that I have an outlier there on the far right and that that whisker went to the next value
in my data that was not deemed an outlier now the five number summary
breaks it data down to 25 chunks a wider whisker on the far right does not mean
more data it just means that that section of the data is more spread out so each chunk Below q1 in between q1 the
median in between the median and Q3 and from Q3 all the way to that outlier represents 25 percent of data wider
simply means more spread out it doesn't mean more data now the cool thing is through a box Bond you can also see the
shape you clearly see the shape of this data is skewed to the right because fifty percent of the data is towards the
bottom kind of clustered together and then the upper 50 of data is way more spread out so if you visualize that as
they skewed right graph here we see two more box plots that are symmetric this
is going back to those pink and orange graphs that were both symmetric in different ways and now you can actually see that in these box plots the first
one is spread out with some outliers on the left and out lies on the right but
we see our whiskers are about the same size that means they have about equal sprun the left and right now the median's not right smack dab in the
middle of the box and that's okay but still pretty evenly balanced which represents symmetry then the bottom
graph we see that the data is way more spread out look at that middle fifty percent in the box is way more spread
out that's because to grab the majority of the data the box has to go way to the
left and way to the right because again look at the histogram the majority of data is way to the left and way to the
right so the middle fifty percent is going to be way wider to capture that data now that we've learned all the
different summary statistics for a quantitative variable we can see how they all kind of fit together and really tell us a lot about the data and one
thing that the AP statistics Exam loves to do is give you a set of summary statistics and have you complete some
tasks with it so here we're going to take a look at another set of 174 trees where the heights of each tree was
measured now across the top we see the summary statistic 6 the mean the median Min q1 Q3 the max of standard deviation
and the first thing I noticed is that the mean is lower than the median so the data has a shape that is skewed left
also the median is closer to the third quartile than it is the first quartile now what that means is that because
there is more distance between the first quartile and the median does not mean there's more data it just means that
section that is more spread out we also notice that the third quartile is closer to the max than the first quartile is
closer to the mean meaning that the distance between the first quartile and the Min is extremely far which again is
showing that that side of the data the left side of the data is more spread out all signs point to the bottom 50 of the
data being more spread out than the top 50 percent which makes our data skewed to the left another very common question
has you analyze the standard deviation the standard deviation tells us that the majority of trees in this sample are
within 28.96 feet of the mean of 104.82 feet remember the standard deviation
tells you how far typical data is from the mean and within means plus or minus so if we take our mean and we add 28.96
we subtract 28.96 that tells us where the majority of our data Falls now that
standard deviation is kind of large to be quite honest which is again another sign that the data is fairly spread out
now they also love asking you to talk about outlier so remember we have two
different outlier formulas in red I have defense method here we're taking the uh
third quartile 125 we're adding 1.5 times the IQR which is Q3 minus q1 and
we get 185. now the first thing I notice is the max is only 135 which means that
there is obviously no values bigger than 185 so there's no upper outliers now the lower fence is q1 85 minus 1.5 times the
rqr and we get 25 here now the Min is 22 which is below 25 so we for sure know
that we have at least one outlier the 22 foot tree but the idea here is without
knowing every single individual data point there could be more outliers like for example there could be a tree that's
23 or a tree that's 24 feet that would again be below 25 but we don't know all
those values we only know the Min so that's why it's important to make sure we emphasize that there's only at least one outlier that could be more but
without the data we don't know we can also use our mean incident deviation formula by taking the mean adding 2 and
subtracting two standard deviations here we get an interval where we know a large large majority of our data Falls and any
values outside this would be deemed outliers now the top of that interval is 162.74 feet and again with our Max of
135 there's clearly no outliers on that top end but here we have a 46.9 the low
end which would tell us that any tree including that Min of 22 would definitely be considered an outlier now
here we see a box plot that I actually created for this data now if you're asked to make a modified box plot you do
need to show the outliers so I actually had all this data and there was one
other tree of 23 feet so that's why you see two dots there 22 and 23 and then that whisper goes to the next value it
looks to be about 30 that was not an outlier now again we see where the q1 the median Q3 and that max value fall
now sometimes an AP exam if you don't actually have all the data you only have your five number summary they'll try to
just make a regular box plot not showing any outliers because you could certainly do that with the five number summary
alone all right that's it for this example hopefully that made a lot of sense now another really important task that very
often comes up on the AP stats exam whether it be to multiple choice or an frq is comparing two different
distributions maybe we have two histograms two box plots um or even two stem and leaf plots which
we could call back to back cinnamon leaf plot but through any of these things we want to make sure we compare and when we
compare please make sure that we use comparative language like greater than less than um bigger smaller higher lower all those
different things or even a they're just flat out the same now when you're comparing we want to compare the centers
we want to compare the shapes we want to compare the spreads we want to compare the presence or absence of outliers
let's take a look at an example here we see what we call parallel box plots there are two box plots that are
parallel and on the same x-axis now oftentimes we're going to be asked to do is compare so the top is trees from the
west side of the forest and the bottom box plus trees from the East side so what could we say about the shapes well
we'd say they're both approximately skewed to the right we see at the bottom fifty percent on both graphs is well
less spread out than the upper fifty percent on both graphs so they're both a little bit skewed to the right we also
identify that neither graph has any outliers and then we also could look at the medians the median for the East trees is
20 feet where the median for the West trees is 33 so it clearly has a higher Center
we could also look at the middle 50 percent the IQR for the top West trees
is way more spread out than the IQR or the middle fifty percent for the East trees so when we're looking at these
different graphs we want to talk about shape maybe it's the same in this case you could write Center the median is
higher for one than the other and spread as well being able to compare two distributions really is vital it comes
up almost every single year on the frq section of the AP exam so make sure you take your time with it use comparative
language and speak in context don't just say oh the one has a center of 33 and
the other has a center of 20. 33 what fish inches centimeters seconds no way
trees from the East are a little bit taller than trees from the West one does the center of around 33 feet one has a
center around 20 feet use things like that to make sure you speak in context especially when you're comparing two
distributions in this last section of unit one things take up pretty cool crazy twists now here's the deal some
sets of data can be modeled with what we call a density curve a density curve is
used to model a set of data to give us some insight as to what the population of that sample data came from could
possibly look like some sets of data can be described as approximately normally distributed this is the most famous type
of density curve there is the normal distribution now the normal distribution is unimodal Mount shaped and symmetric
and it can be described with the parameters of the population mean and the population standard deviation so
here we see that normal curve again Mount shaped and symmetric right smack dab of the middle was the mean and then
as we move to the right we go up one up two up three simulations down one down two down three standard deviations now
the normal model is used for continuous quantitative variables which again remember have infinite possibilities all
the way up towards positive infinity and all the way down towards negative Infinity so why do we stop the normal model at three some deviations above the
mean and three simulations below the mean because honestly a huge chunk of data is within three standard deviations
if it's normally distributed there's just very little data above three standard deviations or below three simulations for us to even worry about
please note that not all data sets follow a normal distribution furthermore a sample might look unimodal Mound
shaped and symmetric and you might want to say that it is a normal distribution but remember only a population could
have been officially modeled with a normal distribution now here's what's really cool about normal distribution is
that they're actually very predictable we know that 68 of data in a population
is within 1 standard deviation of the mean 95 percent of data within a population
is within two standard deviations of the mean and 99.7 percent is within three standard
deviations of the mean that is why even though a normal distribution is continuous all the way down towards
negative infinity and up towards positive Infinity we usually stop drawing it at negative three in positive
three standard deviations because it's so unlikely for data to be outside of that most data pretty much all 99.7 is
within three standard deviations of the me we actually call this the empirical rule let's say that a large Force has trees
that do in fact follow a normal distribution when it comes to their heights they would have a mean of 80
feet and a standard deviation of 18 feet here is what that normal distribution would look like in this scenario now
remember tree height is a continuous quantitative variable so technically the
height of a tree can be anything as low as negative infinity or as high as positive Infinity if you want to look at it that way but when we draw the normal
model there is no reason for us to go below 26 feet or above 134 feet because that is three standard deviations above
and three simulations below the mean which is where 99.7 trees in this Force are going to fall anyway now the formula
for standardized scores or Z scores is actually really really simple to find a z-score you simply take your individual
value in this case a tree hide subtract the mean mu and divide by the standard deviation Sigma I just want to make sure
I emphasize that when you're going to use a calculator to do this do the numerator first and then hit enter
divide by the standard deviation now once once again a z-score measures how many standard deviations above or below
the mean U could be so Z scores can be negative Z scores could be positive but again don't go back and forget the idea
that I said most data is within three so getting e z score of negative four
negative five positive seven positive 18. those are extremely crazy z-scores
because most data will fall within three standard deviations of your mean here we see a standard normal model which only
is labeled by the Z scores zero in the middle because the mean is zero centimations from itself then we go up
one up two up three simulations down one down two down three simulations now what's really cool about the standard
normal model and Z scores it allows us to compare anything so for example you
might think it'd be impossible to compare the height of the tree to the weight of a bear but if you standardize their scores giving the Z scores for a
tree and the z-score for a bear putting them on to the same standard normal model then you could really figure out
oh that bear has a z-score 1.3 where that tree only has a z-score of 0.9 clearly that is a bigger bear so even
though bears and trees are things that you wouldn't seemingly ever compare you actually can if you standardize their
scores so let's go back to our 100 foot tree question the first thing we can do is standardize the score for a 100 foot
tree so we're going to take a hundred subtract 80 divide by the simulation to get a z-score of 1.11 so we see that
spot indicated on our standard normal that is where 100 foot tree is because it's 1.1 standard deviations above the
mean so one question we could ask is What proportion of trees are below 100 feet so now that we have the z-score for
100 feet we could again use technology so here I'm showing you how to use a TI-84 calculator you're going to hit
second vars to go to normalcdf the lower value is negative 99 that's essentially acting as a negative Infinity we don't
have an Infinity button on the calculator so we're just going to extremely low z-score and the upper
value is going to be that 1.11 again it works left to right lower left upper right so if we're looking at the Shaded
region trees below 100 feet or below a z-score of 1.11 we're going to start at negative 99 way down below and go up to
negative 1.11 and the 1084 calculator tells us 0.867 so 86.7 of trees fall
into that range that are below 100 feet as long as it's a normal distribution
now we could also use Desmos Desmos makes it pretty easy to do normal distributions with the command here it
is we're just looking at negative Infinity to a z or to a z-score at the top Max of 1.11 and we also get 86.7
percent of trees going to below 100 feet and here's an example of what we use one of those standard normal tables now a
lot of teachers might not even teach us anymore because there's a little bit old school but there are tables where you are actually going to look up your
z-score and inside the table it gives you the proportion that is below that from a standard normal table
so on the left side we look up the first decimal place that's the 1.1 so we have the one and then the point one and then
across the top we find that second decimal which is the also A1 in this case we're going to go to the 0.01
column so in total that'd be 1.11 and we just cross that row and column together
and we get 0.867 once again telling us that 86.7 per tree uh percent of trees
in this Force are below 100 feet we could also use this exact same procedure to find the proportion of trees that are
greater than 100 feet again it's the same z-score 1.11 when we go to our td4 calculator now the lower value is going
to be 1.11 and the upper value is going to be 99 we could also use decimals or we could use a standard normal table
just be careful standard normal tables only give you the proportion below the
particular z-score that you look up so if the question is asking about going greater than 1.11 you have to First Look
up the value in the Z table that represents below and then simply take one minus that proportion and then
you're going to get the opposite of it which would clearly be the proportion of trees above 100 feet or a z-score above
1.11 either way we get the same proportion for trees that are greater than 100 feet we can even find the
proportion of trees that are between 70 and 100 feet we got to get the z-score for both 70 and the z-score for 100 then
we could use normalcdf on our t84 calculator to look in between those TZ scores or we can even use Desmos to look
in between those TZ scores as well once again you could also use the standard normal table just involves a little bit
more work because you have to look at the proportion of data below the higher z-score then you got to look up the
proportion of data below the smaller z-score subtract them to get the proportion in between most people don't
use standard normal model tables anymore but if you're trained on how to use them it is still pretty easy the normal distribution even allows us to work
backwards to solve some really cool problems here we could be given the area or the proportion under a center normal
curve and what we can do is use technology or send an normal table to actually find the z-score that
represents that particular area let's look at this to an example in the forest with trees whose Heights fall a standard
normal distribution what they mean at 80 feet and a standard deviation of 18 feet what height would Mark the 80th
percentile so remember what a percentile is it's the percentage of trees at a particular value so what we're asking
here is what height of a tree would represent the position that is 80 percent of trees less than it which
would simultaneously mean 20 above it we could use technology or extend a normal table to get us the z-score that
represents this position with 80 percent below it here's how it works on your
TI-84 calculator you're going to use the command invert Norm now in the command for invert Norm you have to ask the area
the area you're looking for the area that you type in is the proportion below the area below or the area to the left
so I'm going to type in 0.8 because that's what we're trying to find the z-score that has 80 percent to the left
of it or below it and when we use the infer Norm command we get that z-score of 0.842 now we could also use a
standard normal table what we have to do is actually use it in reverse so we're going to actually look inside the table and find approximately 0.80 now we don't
actually see it exactly but we see two numbers that are really close we see 0.7995 in 0.8023 to be honest you could
probably use either one of them and be okay but technically 0.7995 is really really close to 80 percent or 0.8 so the
z-score that represents that 0.7995 below it is 0.84
now I'm going to be a little bit more specific and use the 0.842 for my calculator now we're going to do is
we're going to take our z-score formula we know the mean is 80. we know this is Aviation 18 but now we know the z-score
that represents this 80th percentile is 0.842 then we can just work backwards
multiply the standard deviation over that's 18 and then add the 80. and this gives us the height of a tree that would
represent the 80th percentile in this case we've got 95.156 feet so 80 percent of trees in
the force are below 95 feet and 20 percent are above it another example might ask us something like what tree
height represents the top five percent of all trees in the force so once again we got to get some technology here to
figure out what Z score represents that top five percent but keep in mind that when you're talking about the top five
percent you're simultaneously talking about the bottom 95 below it it's the same value so once again we could go to
our TI-84 calculator and we could type in 0.95 because that's the area to the
left or the area below which again represents 95 below but the same thing is saying five percent above and we get
a z-score of 1.645 we can all see user standard normal table in look up 0.95
once again this is one of the drawbacks of using the tables is you're not going to find it precisely but we can get
pretty close here and we see about 0.9495 or 0.9505 both are equally as
close to 0.95 so that's the z-score of 1.64 now again I'm always going to be
trying to be using technology to be a little bit more accurate here so I'm going to go with the 1.645 as that z-score once again substituting that in
for Z I know the mean 80 I know this animation of 18 multiply this integration over add 80 and I get the
height of a tree that represents that 95th percentile so five percent increase in the force are taller than 109.61 feet
and 95 percent of trees are below that now I have to be honest with you there are so many more normal distribute
calculation problems that can be done other than the ones I just went over in fact there's a huge plethora of
different types of problems that all involve the normal distribution some are really easy like the ones we talked about in this video and some can be much
more complex so here's the deal I don't have time in this review video to go over all of them but what you can do is
visit my YouTube channel where I have a playlist for the normal distribution I have tons of videos to go over all the
different types of problems that could come up when you're doing normal distribution calculations they could be
really fun in my opinion but they could also be a little bit challenging which hopefully makes it fun but please check
out my YouTube playlist where you can learn much much more about the normal distribution and all the different calculations that can be done with it
all right well that's a wrap on unit one exploring one variable data it's a pretty thick unit with lots of
information in it so hopefully I didn't review it too fast but please take a look at that study guide through the
Ultra review packet you also have the answer key in just doing that study guide checking out your answers is going
to really help prepare you not just for the unit 1 you have in class but it's also going to really help prepare you for the AP stats exam in may now unit
one really kind of sets the foundation for the entire course because everything starts with analyzing data if you know
how to analyze data talk about data understand summary statistics now everything ties together it's going to
really help prepare you for everything else in this course so best of luck hopefully you learned a lot I can't wait
to see you in the next video
English (auto-generated)