Modeling Distributions of Data [The Practice of Statistics- Chapter 2]

Introduction

Graphical models called density curves can be helpful to describe the location of individuals within a distribution. Such models are especially helpful when data falls in a bell-shaped pattern called a normal distribution.

2.1- Describing Location in a Distribution

Measuring Position: Percentiles

One way to describe a data point’s location in the distribution is to tell what percent of observations are less than it or, the percentile.

IMPORTANT: Some people define the pth percentile as the value of p percent of observations less than or equal it.

Cumulative Relative Frequency Graphs

Cumulative relative frequency is the addition of all the counts for the current class and all classes with smaller values of the variable, divided by n, and multiplied by 100 to be turned into a percent.

To make a cumulative relative frequency graph, we plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class.

Measuring Position: z-Scores

Converting observations from original values to standard deviation units is known as standardizing. To standardize a value, subtract the mean of the distribution then divide the difference by the standard deviation.

  • If x is an observation from a distribution that has a known mean and standard deviation, the standardized score (z-score) for x is

    z= (x-mean)/standard deviation

We often standardize observations to express them on a common scale.

Ex: comparing the heights of two children of different ages

Transforming Data

To find the standardized score (z-score) for an individual observation, the data is transformed by subtracting the mean and dividing the difference by the standard deviation. Transforming converts the observation from the original units of measurement to a standardized scale.

Effect of Adding (or Subtracting) a Constant

Adding the same positive number a to (or subtracting a from) each observation

  • adds a to (or subtracts a from) measures of center and location (mean, median, quartiles, percentiles)

  • does not change the shape of the distribution or measures of spread (range, IQR, standard deviation)

Effect of Multiplying (or Dividing) by a Constant

Multiplying (or dividing) each observation by the same positive number b

  • multiplies (divides) measures of center and location (mean, median, quartiles, percentiles) by b

  • multiplies (divided) measures of spread (range, IQR, standard deviation) by b

  • does not change the shape of the distribution

Connecting Transformations and z-Scores

When dealing with z-scores, the shape of the distribution stays the same despite the transformations. However, the center and spread do change. For a z-score distribution, the mean is always 0 and the standard deviation is always 1.

2.2- Density Curves and Normal Distributions

Density Curves

A density curve is a curve that

  • is always on or above the horizontal axis

  • has area exactly 1 underneath it

Density curves describe the overall pattern of a distribution. The area under the curve and above any interval or values on the horizontal axis is the proportion of all observations that fall in that interval.

A density curve is often a good description of the overall pattern of a distribution, but don’t include outliers.

IMPORTANT: No set of real data is exactly described by a density curve. The curve is an approximation that is easy to use and accurate enough for practical use.

Describing Density Curves

Measures of center and spread apply to density curves in addition to the actual sets of data.

The median of a density curve is the “equal-areas point”, the point where half the area under the curve is to the left and the other half is to the right. Since density curves are idealized patterns, a symmetric density curve is exactly symmetric. Therefore, the median is exactly at the center. When the data is skewed, it’s harder to tell where the median is and a mathematical process is needed to find it.

The mean of a density curve is the point at which the curve would balance if made of solid material.

The median and mean are the same for a symmetric density curve. They’re at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the tail.

Since the density curve is an idealized description of the distribution, the notation for mean and standard deviation are different.

The notation for mean of a density curve is the Greek letter mu and the notation for standard deviation is the Greek letter sigma.

Normal Distributions

One particularly important class of density curves are normal curves and the distributions they describe are normal distributions.

  • All normal curves have the same overall shape: symmetric, single-peaked, and bell-shaped

  • Any specific normal curve is completely described by giving its mean (mu) and standard deviation (sigma)

  • The mean is located at the center of the symmetric curve and is the same as the median

  • The standard deviation controls the spread of a normal curve

Normal distributions are important in statistics because

  1. They’re good descriptions for some distributions of real data (ex: SAT scores)

  2. Normal distributions are good approximations to the results of many kinds of chance outcomes (ex: number of heads in many tosses of a fair coin)

  3. Many statistical inference procedures are based on normal distributions

The 68-95-99.7 Rule

Also known as the “empirical rule”, the 68-95-99.7 rule is followed by all normal distributions.

In a normal distribution with mean mu and standard deviation sigma:

  • Approximately 68% of the observations fall within one standard deviation of the mean

  • Approximately 95% of the observations fall within two standard deviations of the mean

  • Approximately 99.7% of the observations fall within three standard deviations of the mean

IMPORTANT: The 68-95-99.7 rule applies to only normal distributions.

The Standard Normal Distribution

Changing to standardized units z uses the formula

z=(x-mu)/sigma

If the variable we standardize has a normal distribution, then so does the new variable z. This new distribution is called the standard normal distribution.

  • The standard normal distribution is the normal distribution with mean 0 and standard deviation 1

  • If a variable x has any normal distribution with mean mu and standard deviation sigma, then the standardized variable z has the standard normal distribution N(0,1)

Because all normal distributions are the same when we standardize, we can find the area under any normal curve from a table. Table A, the standard normal table.

  • Table A is a table of areas under the standard normal curve. The table entry for each value z is the area under the curve to the left of z

Normal Distribution Calculations

We can answer a question about areas in any normal distribution by standardizing and using table A or by using technology.

How to Find Areas in Any Normal Distribution

  1. State the distribution and the values of interest- Draw a normal curve with the area of interest shaded and the mean, standard deviation, and boundary value(s) clearly identified

  2. Perform calculation- choose one: (i) Compute a z-score for each boundary value and use Table A or technology to find the desired area under the standard normal curve; or (ii) use the normalcdf command and label each of the inputs

  3. Answer the question

How to Find Values From Areas in Any Normal Distribution

  1. State the distribution and the values of interest- Draw a Normal curve with the area of interest shaded and the mean, standard deviation, and unknown boundary value clearly identified

  2. Perform calculations- choose one: (i) Use Table A or technology to find the value of z with the indicated area under the standard normal curve, then “unstandardize” to transform back to the original distribution; or (ii) Use the invnorm command and label each of the inputs

  3. Answer the question

Assessing Normality

Just because a distribution looks normal doesn’t mean it is. A normal probability plot provides a good assessment of whether a data set follows a normal distribution. When you examine a normal probability plot, look for shapes that show clear departures from normality.

Interpreting Normal Probability Plots

If the points on a normal probability plot are close to a straight line, the data are approximately normally distributed. Systematic deviations from the straight line show the data isn’t normally distributed. Outliers appear as points that are far away from the overall pattern of the plot.

How can we determine shape from a normal probability plot?

In a right-skewed distribution, the largest observations fall distinctly to the right of a line drawn through the main body of points. Similarly, left skewness is evident when the smallest observations fall to the left the line.

robot