Modeling Distributions of Data [The Practice of Statistics- Chapter 2]
Graphical models called density curves can be helpful to describe the location of individuals within a distribution. Such models are especially helpful when data falls in a bell-shaped pattern called a normal distribution.
One way to describe a data point’s location in the distribution is to tell what percent of observations are less than it or, the percentile.
IMPORTANT: Some people define the pth percentile as the value of p percent of observations less than or equal it.
Cumulative relative frequency is the addition of all the counts for the current class and all classes with smaller values of the variable, divided by n, and multiplied by 100 to be turned into a percent.
To make a cumulative relative frequency graph, we plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class.
Converting observations from original values to standard deviation units is known as standardizing. To standardize a value, subtract the mean of the distribution then divide the difference by the standard deviation.
If x is an observation from a distribution that has a known mean and standard deviation, the standardized score (z-score) for x is
z= (x-mean)/standard deviation
We often standardize observations to express them on a common scale.
Ex: comparing the heights of two children of different ages
To find the standardized score (z-score) for an individual observation, the data is transformed by subtracting the mean and dividing the difference by the standard deviation. Transforming converts the observation from the original units of measurement to a standardized scale.
Adding the same positive number a to (or subtracting a from) each observation
adds a to (or subtracts a from) measures of center and location (mean, median, quartiles, percentiles)
does not change the shape of the distribution or measures of spread (range, IQR, standard deviation)
Multiplying (or dividing) each observation by the same positive number b
multiplies (divides) measures of center and location (mean, median, quartiles, percentiles) by b
multiplies (divided) measures of spread (range, IQR, standard deviation) by b
does not change the shape of the distribution
When dealing with z-scores, the shape of the distribution stays the same despite the transformations. However, the center and spread do change. For a z-score distribution, the mean is always 0 and the standard deviation is always 1.
A density curve is a curve that
is always on or above the horizontal axis
has area exactly 1 underneath it
Density curves describe the overall pattern of a distribution. The area under the curve and above any interval or values on the horizontal axis is the proportion of all observations that fall in that interval.
A density curve is often a good description of the overall pattern of a distribution, but don’t include outliers.
IMPORTANT: No set of real data is exactly described by a density curve. The curve is an approximation that is easy to use and accurate enough for practical use.
Measures of center and spread apply to density curves in addition to the actual sets of data.
The median of a density curve is the “equal-areas point”, the point where half the area under the curve is to the left and the other half is to the right. Since density curves are idealized patterns, a symmetric density curve is exactly symmetric. Therefore, the median is exactly at the center. When the data is skewed, it’s harder to tell where the median is and a mathematical process is needed to find it.
The mean of a density curve is the point at which the curve would balance if made of solid material.
The median and mean are the same for a symmetric density curve. They’re at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the tail.
Since the density curve is an idealized description of the distribution, the notation for mean and standard deviation are different.
The notation for mean of a density curve is the Greek letter mu and the notation for standard deviation is the Greek letter sigma.
One particularly important class of density curves are normal curves and the distributions they describe are normal distributions.
All normal curves have the same overall shape: symmetric, single-peaked, and bell-shaped
Any specific normal curve is completely described by giving its mean (mu) and standard deviation (sigma)
The mean is located at the center of the symmetric curve and is the same as the median
The standard deviation controls the spread of a normal curve
Normal distributions are important in statistics because
They’re good descriptions for some distributions of real data (ex: SAT scores)
Normal distributions are good approximations to the results of many kinds of chance outcomes (ex: number of heads in many tosses of a fair coin)
Many statistical inference procedures are based on normal distributions
Also known as the “empirical rule”, the 68-95-99.7 rule is followed by all normal distributions.
In a normal distribution with mean mu and standard deviation sigma:
Approximately 68% of the observations fall within one standard deviation of the mean
Approximately 95% of the observations fall within two standard deviations of the mean
Approximately 99.7% of the observations fall within three standard deviations of the mean
IMPORTANT: The 68-95-99.7 rule applies to only normal distributions.
Changing to standardized units z uses the formula
z=(x-mu)/sigma
If the variable we standardize has a normal distribution, then so does the new variable z. This new distribution is called the standard normal distribution.
The standard normal distribution is the normal distribution with mean 0 and standard deviation 1
If a variable x has any normal distribution with mean mu and standard deviation sigma, then the standardized variable z has the standard normal distribution N(0,1)
Because all normal distributions are the same when we standardize, we can find the area under any normal curve from a table. Table A, the standard normal table.
Table A is a table of areas under the standard normal curve. The table entry for each value z is the area under the curve to the left of z
We can answer a question about areas in any normal distribution by standardizing and using table A or by using technology.
State the distribution and the values of interest- Draw a normal curve with the area of interest shaded and the mean, standard deviation, and boundary value(s) clearly identified
Perform calculation- choose one: (i) Compute a z-score for each boundary value and use Table A or technology to find the desired area under the standard normal curve; or (ii) use the normalcdf command and label each of the inputs
Answer the question
State the distribution and the values of interest- Draw a Normal curve with the area of interest shaded and the mean, standard deviation, and unknown boundary value clearly identified
Perform calculations- choose one: (i) Use Table A or technology to find the value of z with the indicated area under the standard normal curve, then “unstandardize” to transform back to the original distribution; or (ii) Use the invnorm command and label each of the inputs
Answer the question
Just because a distribution looks normal doesn’t mean it is. A normal probability plot provides a good assessment of whether a data set follows a normal distribution. When you examine a normal probability plot, look for shapes that show clear departures from normality.
If the points on a normal probability plot are close to a straight line, the data are approximately normally distributed. Systematic deviations from the straight line show the data isn’t normally distributed. Outliers appear as points that are far away from the overall pattern of the plot.
In a right-skewed distribution, the largest observations fall distinctly to the right of a line drawn through the main body of points. Similarly, left skewness is evident when the smallest observations fall to the left the line.
Graphical models called density curves can be helpful to describe the location of individuals within a distribution. Such models are especially helpful when data falls in a bell-shaped pattern called a normal distribution.
One way to describe a data point’s location in the distribution is to tell what percent of observations are less than it or, the percentile.
IMPORTANT: Some people define the pth percentile as the value of p percent of observations less than or equal it.
Cumulative relative frequency is the addition of all the counts for the current class and all classes with smaller values of the variable, divided by n, and multiplied by 100 to be turned into a percent.
To make a cumulative relative frequency graph, we plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class.
Converting observations from original values to standard deviation units is known as standardizing. To standardize a value, subtract the mean of the distribution then divide the difference by the standard deviation.
If x is an observation from a distribution that has a known mean and standard deviation, the standardized score (z-score) for x is
z= (x-mean)/standard deviation
We often standardize observations to express them on a common scale.
Ex: comparing the heights of two children of different ages
To find the standardized score (z-score) for an individual observation, the data is transformed by subtracting the mean and dividing the difference by the standard deviation. Transforming converts the observation from the original units of measurement to a standardized scale.
Adding the same positive number a to (or subtracting a from) each observation
adds a to (or subtracts a from) measures of center and location (mean, median, quartiles, percentiles)
does not change the shape of the distribution or measures of spread (range, IQR, standard deviation)
Multiplying (or dividing) each observation by the same positive number b
multiplies (divides) measures of center and location (mean, median, quartiles, percentiles) by b
multiplies (divided) measures of spread (range, IQR, standard deviation) by b
does not change the shape of the distribution
When dealing with z-scores, the shape of the distribution stays the same despite the transformations. However, the center and spread do change. For a z-score distribution, the mean is always 0 and the standard deviation is always 1.
A density curve is a curve that
is always on or above the horizontal axis
has area exactly 1 underneath it
Density curves describe the overall pattern of a distribution. The area under the curve and above any interval or values on the horizontal axis is the proportion of all observations that fall in that interval.
A density curve is often a good description of the overall pattern of a distribution, but don’t include outliers.
IMPORTANT: No set of real data is exactly described by a density curve. The curve is an approximation that is easy to use and accurate enough for practical use.
Measures of center and spread apply to density curves in addition to the actual sets of data.
The median of a density curve is the “equal-areas point”, the point where half the area under the curve is to the left and the other half is to the right. Since density curves are idealized patterns, a symmetric density curve is exactly symmetric. Therefore, the median is exactly at the center. When the data is skewed, it’s harder to tell where the median is and a mathematical process is needed to find it.
The mean of a density curve is the point at which the curve would balance if made of solid material.
The median and mean are the same for a symmetric density curve. They’re at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the tail.
Since the density curve is an idealized description of the distribution, the notation for mean and standard deviation are different.
The notation for mean of a density curve is the Greek letter mu and the notation for standard deviation is the Greek letter sigma.
One particularly important class of density curves are normal curves and the distributions they describe are normal distributions.
All normal curves have the same overall shape: symmetric, single-peaked, and bell-shaped
Any specific normal curve is completely described by giving its mean (mu) and standard deviation (sigma)
The mean is located at the center of the symmetric curve and is the same as the median
The standard deviation controls the spread of a normal curve
Normal distributions are important in statistics because
They’re good descriptions for some distributions of real data (ex: SAT scores)
Normal distributions are good approximations to the results of many kinds of chance outcomes (ex: number of heads in many tosses of a fair coin)
Many statistical inference procedures are based on normal distributions
Also known as the “empirical rule”, the 68-95-99.7 rule is followed by all normal distributions.
In a normal distribution with mean mu and standard deviation sigma:
Approximately 68% of the observations fall within one standard deviation of the mean
Approximately 95% of the observations fall within two standard deviations of the mean
Approximately 99.7% of the observations fall within three standard deviations of the mean
IMPORTANT: The 68-95-99.7 rule applies to only normal distributions.
Changing to standardized units z uses the formula
z=(x-mu)/sigma
If the variable we standardize has a normal distribution, then so does the new variable z. This new distribution is called the standard normal distribution.
The standard normal distribution is the normal distribution with mean 0 and standard deviation 1
If a variable x has any normal distribution with mean mu and standard deviation sigma, then the standardized variable z has the standard normal distribution N(0,1)
Because all normal distributions are the same when we standardize, we can find the area under any normal curve from a table. Table A, the standard normal table.
Table A is a table of areas under the standard normal curve. The table entry for each value z is the area under the curve to the left of z
We can answer a question about areas in any normal distribution by standardizing and using table A or by using technology.
State the distribution and the values of interest- Draw a normal curve with the area of interest shaded and the mean, standard deviation, and boundary value(s) clearly identified
Perform calculation- choose one: (i) Compute a z-score for each boundary value and use Table A or technology to find the desired area under the standard normal curve; or (ii) use the normalcdf command and label each of the inputs
Answer the question
State the distribution and the values of interest- Draw a Normal curve with the area of interest shaded and the mean, standard deviation, and unknown boundary value clearly identified
Perform calculations- choose one: (i) Use Table A or technology to find the value of z with the indicated area under the standard normal curve, then “unstandardize” to transform back to the original distribution; or (ii) Use the invnorm command and label each of the inputs
Answer the question
Just because a distribution looks normal doesn’t mean it is. A normal probability plot provides a good assessment of whether a data set follows a normal distribution. When you examine a normal probability plot, look for shapes that show clear departures from normality.
If the points on a normal probability plot are close to a straight line, the data are approximately normally distributed. Systematic deviations from the straight line show the data isn’t normally distributed. Outliers appear as points that are far away from the overall pattern of the plot.
In a right-skewed distribution, the largest observations fall distinctly to the right of a line drawn through the main body of points. Similarly, left skewness is evident when the smallest observations fall to the left the line.