Exploring Data [The Practice of Statistics- Chapter 1]
Statistics is the science of data.
To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s data analysis.
Any set of data contains information about some group of individuals. The characteristics we measure on each individual are called variables.
Individuals are the objects described by a set of data. Individuals may be people, animals, or things.
A variable is any characteristic of an individual. A variable can take different values for different individuals
Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.
Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.
A categorical variable places an individual into one of several groups or categories
A quantitative variable takes numerical values for which it makes sense to find an average
IMPORTANT: Not every variable that takes number values is quantitative (ex: zip codes)
Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s distribution.
The distribution of a variable tells us what values the variable takes and how often it takes these values.
Inference is the idea of drawing conclusions that go beyond the data at hand.
Our ability to do inference is determined by how the data are produced.
The logic of inference rests on asking, “What are the chance?”
The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall within each category.
Frequency tables display the counts (frequencies) of the variable in each category
Relative frequency tables show the percents (relative frequencies) of the variables in each category
Columns of numbers take time to read. You can use a pie chart or a bar graph to display the distribution of data easier.
Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories
Bar graphs represent each category as a bar. The bar heights show the category counts or percents
When you draw a bar graph, make sure to make the bars equally wide. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.
Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.
To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the marginal distribution.
The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table
Percents are often more informative than counts, especially when we comparing groups of different sizes.
Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a conditional distribution.
A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable
There are two sets of conditional distributions for any two-way table: one for the column variable and one for the row variable.
We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.
We say that there’s an association between two variables if knowing the value of one variable helps predict the value of the other
IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.
One of the simplest graphs to construct and interpret is a dotplot where each data value is shown as a dot above its location on a number line.
In any graph, look for the overall pattern and for striking departures from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of departure is an outlier, an individual value that falls outside the overall pattern.
SOCS is the acronym to remember how to describe the pattern
When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough symmetry and clear skewness.
A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other
A distribution is skewed to the right if the right side of the graph is much longer than the left side. Its skewed to the left if the left side of the graph is much longer than the right side.
Whether they are skewed left or right, most graphs are unimodal, meaning they have a single peak. A graph with two clear peaks is bimodal and with more than that, its multimodal.
You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.
Make sure to discuss the distributions of the samples.
Another common graph is a stemplot (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.
To better view and compare data, there’s a few different ways to arrange stemplots.
Splitting stems involves placing leaves 0-4 on one stem and 5-9 on another stem to view the spread of the data easier
Back-to-back stemplots help to compare data by sharing stems and having data on either side
Stemplots don’t work well for large data sets where each stem has to have a lot of leaves
Too few or too many stems will make it hard to see the distribution’s shape
Rounding data is sometimes necessary to not have too many numbers (Ex: a salary of 42,549 can be written with a stem of 4 and a leaf of 3
A graph of the distribution can sometimes be clearer if nearby values are grouped together. A histogram is a graph that shows counts of certain classes.
Don’t confuse histograms and bar graphs
Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations
Just because a graph looks nice doesn’t make it a meaningful display of data
The most common measure of center is the ordinary arithmetic average, or mean.
To find the mean of a set of observations, add their values and divide by the number of observations
The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a sample.
The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is not a resistant measure of center.
The median is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.
To find the mean of a distribution:
Arrange all observations in order of size, from smallest to largest
If the number of observations n is odd, the median is the center observation in the ordered list
If the number of observations n is even, the median is the average of the two center observations in the ordered list
The median, unlike the mean, is a resistant measure of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.
A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the range.
To compute the range, subtract the smallest value from the largest value.
To calculate the quartiles:
Arrange the observations in increasing order and locate the median in the ordered list of observations
The first quartile Q1 is the median of the observations that are to the left of the median in the ordered list
The third quartile Q3 is the median of the observations that are to the right of the median in the ordered list.
The interquartile range (IQR) is defined as
IQR = Q3 - Q1
In addition to serving as a measure of spread, the interquartile range (IQR) is used in the 1.5 x IQR rule for identifying outliers
An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile
The five-number summary of a distribution consists of the smallest observation, written in order from smallest to largest
Minimum, Q1, median, Q3, maximum
The five-number summary leads to a new graph, the boxplot (also known as a box-and-whicker plot).
A central box is drawn from Q1 to Q3
A line in the box marks the median
Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers
Outliers are marked with a special symbol such as an asterisk (*)
The most common numerical description of a distribution is the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation and the variance measure spread by looking at how far the observations are from their mean.
The standard deviation Sx measures the typical distance of the values in a distribution from the mean. It’s calculated by finding an average of the squared deviations, then taking the squared root. This average squared deviation S^2x is called the variance
To find the standard deviation of n observations:
Find the distance of each observation from the mean and square each of these distances
Average the distances by dividing their sum by n-1
The standard deviations Sx is the square root of this average squared distance
When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.
How to organize a statistics problem: A four-step process
State: What’s the question you’re trying to answer?
Plan: How will you go about answering the question? What statistical technique does this problem call for?
Do: Make graphs and carry out needed calculations
Conclude: Give your conclusion in the setting of the real-world problem
Statistics is the science of data.
To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s data analysis.
Any set of data contains information about some group of individuals. The characteristics we measure on each individual are called variables.
Individuals are the objects described by a set of data. Individuals may be people, animals, or things.
A variable is any characteristic of an individual. A variable can take different values for different individuals
Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.
Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.
A categorical variable places an individual into one of several groups or categories
A quantitative variable takes numerical values for which it makes sense to find an average
IMPORTANT: Not every variable that takes number values is quantitative (ex: zip codes)
Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s distribution.
The distribution of a variable tells us what values the variable takes and how often it takes these values.
Inference is the idea of drawing conclusions that go beyond the data at hand.
Our ability to do inference is determined by how the data are produced.
The logic of inference rests on asking, “What are the chance?”
The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall within each category.
Frequency tables display the counts (frequencies) of the variable in each category
Relative frequency tables show the percents (relative frequencies) of the variables in each category
Columns of numbers take time to read. You can use a pie chart or a bar graph to display the distribution of data easier.
Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories
Bar graphs represent each category as a bar. The bar heights show the category counts or percents
When you draw a bar graph, make sure to make the bars equally wide. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.
Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.
To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the marginal distribution.
The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table
Percents are often more informative than counts, especially when we comparing groups of different sizes.
Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a conditional distribution.
A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable
There are two sets of conditional distributions for any two-way table: one for the column variable and one for the row variable.
We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.
We say that there’s an association between two variables if knowing the value of one variable helps predict the value of the other
IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.
One of the simplest graphs to construct and interpret is a dotplot where each data value is shown as a dot above its location on a number line.
In any graph, look for the overall pattern and for striking departures from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of departure is an outlier, an individual value that falls outside the overall pattern.
SOCS is the acronym to remember how to describe the pattern
When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough symmetry and clear skewness.
A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other
A distribution is skewed to the right if the right side of the graph is much longer than the left side. Its skewed to the left if the left side of the graph is much longer than the right side.
Whether they are skewed left or right, most graphs are unimodal, meaning they have a single peak. A graph with two clear peaks is bimodal and with more than that, its multimodal.
You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.
Make sure to discuss the distributions of the samples.
Another common graph is a stemplot (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.
To better view and compare data, there’s a few different ways to arrange stemplots.
Splitting stems involves placing leaves 0-4 on one stem and 5-9 on another stem to view the spread of the data easier
Back-to-back stemplots help to compare data by sharing stems and having data on either side
Stemplots don’t work well for large data sets where each stem has to have a lot of leaves
Too few or too many stems will make it hard to see the distribution’s shape
Rounding data is sometimes necessary to not have too many numbers (Ex: a salary of 42,549 can be written with a stem of 4 and a leaf of 3
A graph of the distribution can sometimes be clearer if nearby values are grouped together. A histogram is a graph that shows counts of certain classes.
Don’t confuse histograms and bar graphs
Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations
Just because a graph looks nice doesn’t make it a meaningful display of data
The most common measure of center is the ordinary arithmetic average, or mean.
To find the mean of a set of observations, add their values and divide by the number of observations
The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a sample.
The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is not a resistant measure of center.
The median is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.
To find the mean of a distribution:
Arrange all observations in order of size, from smallest to largest
If the number of observations n is odd, the median is the center observation in the ordered list
If the number of observations n is even, the median is the average of the two center observations in the ordered list
The median, unlike the mean, is a resistant measure of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.
A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the range.
To compute the range, subtract the smallest value from the largest value.
To calculate the quartiles:
Arrange the observations in increasing order and locate the median in the ordered list of observations
The first quartile Q1 is the median of the observations that are to the left of the median in the ordered list
The third quartile Q3 is the median of the observations that are to the right of the median in the ordered list.
The interquartile range (IQR) is defined as
IQR = Q3 - Q1
In addition to serving as a measure of spread, the interquartile range (IQR) is used in the 1.5 x IQR rule for identifying outliers
An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile
The five-number summary of a distribution consists of the smallest observation, written in order from smallest to largest
Minimum, Q1, median, Q3, maximum
The five-number summary leads to a new graph, the boxplot (also known as a box-and-whicker plot).
A central box is drawn from Q1 to Q3
A line in the box marks the median
Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers
Outliers are marked with a special symbol such as an asterisk (*)
The most common numerical description of a distribution is the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation and the variance measure spread by looking at how far the observations are from their mean.
The standard deviation Sx measures the typical distance of the values in a distribution from the mean. It’s calculated by finding an average of the squared deviations, then taking the squared root. This average squared deviation S^2x is called the variance
To find the standard deviation of n observations:
Find the distance of each observation from the mean and square each of these distances
Average the distances by dividing their sum by n-1
The standard deviations Sx is the square root of this average squared distance
When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.
How to organize a statistics problem: A four-step process
State: What’s the question you’re trying to answer?
Plan: How will you go about answering the question? What statistical technique does this problem call for?
Do: Make graphs and carry out needed calculations
Conclude: Give your conclusion in the setting of the real-world problem