Descriptive Statistics | Inferential Statistics |
---|---|
also known as exploratory data analysis (EDA). It is concerned with only the data at hand. | involves using our data to make a stronger statement. |
I. Quantitative Data: Also known as Numerical data. It deals with measures or counts. For example: heights, speeds, scores in an exam.
II. Categorical data: Also known as Qualitative data.
Quantitative variables are either discrete or continuous.
I. Discrete variables take on values with gaps between. For example: the number of heads we get on flipping a coin 10 times.
II. Continuous variables take on any value in an interval. For example: The time of day.
The extent to which the graph appears to be symmetric, mound-shaped ( bell-shaped ), skewed, bimodal, or uniform. Here are some examples of differently shaped graphs:
Dotplot: Involves plotting the data values, with dots, above the corresponding values on a number line.
Stemplot (or Stem-and-Leaf Plot):
Each data value has a stem and a leaf .
There are no mathematical rules for what constitutes the stem and what constitutes the leaf.
The nature of the data will suggest reasonable choices for the stem and leaves.
In the below stem and leaf plot, we see that the scores on Quiz 1 (on the left) were generally higher than for those on Quiz 2—the center of Quiz 1 scores is higher than the center of Quiz 2 scores. Both distributions are reasonably symmetric.
The spreads of the two distributions appear to be similar.
Bar Charts:
Bar charts are used to illustrate categorical data.
The horizontal axis contains the categories.
The vertical axis contains the frequencies, or relative frequencies, of each category.
There is a space between the bars.
\
Histograms:
Histogram is used to illustrate quantitative data.
The horizontal axis contains numerical values, and the vertical axis contains the frequencies, or relative frequencies, of the values (often intervals of values).
A histogram divides the number line into intervals (bins) of equal width.
A bar is constructed on each interval, and the height of the bar is the number of cases in that interval.
By convention, a value that lies on the boundary between two intervals is included in the interval to the right. So the interval from 25 to 35 contains 25 but not 35.
Mean:
Let xi represent any value in a set of n values ( i = 1, 2, . . . , n ). The mean of the set is defined as the sum of the x ’s divided by n.
Example problem:
During his major league career, Babe Ruth hit the following number of home runs (1914–1935): 0, 4, 3, 2, 11, 29, 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22, 6. What was the mean number of home runs per year for his major league career?
Median:
Example problem:
Consider once again the data in the previous example from Babe Ruth’s career. What was the median number of home runs per year he hit during his major league career?
Solution:
First, put the numbers in order from smallest to largest: 0, 2, 3, 4, 6, 11, 22, 25, 29, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60. There are 22 scores, so the median is found at the 11.5th position, between the 11th and 12th scores (35 and 41). So the median is (35+41)/2=38.
Resistant Values:
Example:
Noor and Jacob were collecting data for a statistics class. They used a precise scale borrowed from the science department and weighed 100 U.S. pennies. They did not provide the raw data, but these are summarized in a frequency table and a histogram. Estimate the median weight of their sample of pennies.
The cumulative frequency of 44 in the third row means that 44 pennies weigh less than 3.10 grams. 29 + 14 + 1 = 44. We know the median is between the 50th and 51st pennies, so the median is between 3.10 and 3.14 grams. Our estimate could be anywhere in this interval.
Variance and Standard Deviation:
Variance is the average squared deviation from the mean.
The more distant a value is from the mean, the larger will be the square of the difference between it and the mean.
However, the units for the variance won’t match the units of the original data because each difference is squared.
For example, the variance of a set of measurements made in inches will be in square inches.
To correct this, we often take the square root of the variance as our measure of spread.
The square root of the variance is known as the standard deviation.
Like the mean, s is not resistant to extreme values.
Interquartile Range:
Outliers:
• Find the IQR.
• Multiply the IQR by 1.5.
• Find Q1 − 1.5(IQR) and Q3 + 1.5(IQR).
• Any value below Q1 − 1.5(IQR) or above Q3 + 1.5(IQR) is a potential outlier
\
Example:
The following data represent the amount of money, in British pounds, spent weekly on tobacco for 11 regions in Britain: 4.03, 3.76, 3.77, 3.34, 3.47, 2.92, 3.20, 2.71, 3.53, 4.51, 4.56. Do any of the regions seem to be spending a lot more or less than the other regions? That is, are there any outliers in the data?
Solution:
Using a calculator, we find the following:
x̄ = 3.62
Sx = s = 0.59
Q1 = 3.2
Q3 = 4.03
Using means:
==Required interval = 3.62 ± 2(0.59) = (2.44, 4.8).== There are no values in the dataset less than 2.44 or greater than 4.8, so there are no outliers by this method.
Using the 1.5(IQR) rule:
Q1 − 1.5(IQR) = 3.2 − 1.5(4.03 − 3.2) = ==1.96==
Q3 + 1.5(IQR) = 4.03 + 1.5(4.03 − 3.2) = ==5.28.==
There are no values in the data less than 1.96 or greater than 5.28, thus there are no outliers by this method either.
The five-number summary of a dataset is composed of the minimum value, the lower quartile, the median, the upper quartile, and the maximum value.
A box-and-whiskers plot is simply a graphical version of the five-number summary.
A box is drawn that contains the middle 50% of the data and “whiskers” extend from the lines at the ends of the box to the minimum and maximum values of the data.
If there are outliers, the “whiskers” extend to the last value before the outlier that is not an outlier.
The outliers themselves are marked with a special symbol, such as a point, a box, or a plus sign.
The proportion of terms in the distribution less than the term. For example, a term that is at the 75th percentile is larger than 75% of the terms in a distribution.
Example:
For the first test of the year, Harvey got a 68. The class average (mean) was 73, and the standard deviation was 3. What was Harvey’s z -score on this test?
Solution:
Thus, Harvey was 1.67 standard deviations below the mean.
The 68-95-99.7 rule, or the empirical rule, states that approximately:
\
Standard Normal Distribution:
We convert the data to a set of z -scores, using the formula.
we use Meu and sigma in order to standardize the data in a normal distribution to produce a standard normal distribution.
Although you are not required to know it, you might be interested to see that the function that defines the normal curve is:
Click the link to go to the next chapter:
\