Exploring Data [The Practice of Statistics- Chapter 1]

Statistics is the science of data.

To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s data analysis.

Any set of data contains information about some group of individuals. The characteristics we measure on each individual are called variables.

Individuals are the objects described by a set of data. Individuals may be people, animals, or things.

A variable is any characteristic of an individual. A variable can take different values for different individuals

Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.

Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.

A categorical variable places an individual into one of several groups or categories

A quantitative variable takes numerical values for which it makes sense to find an average

**IMPORTANT:** **Not every variable that takes number values is quantitative (ex: zip codes)**

Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s distribution.

The distribution of a variable tells us what values the variable takes and how often it takes these values.

Inference is the idea of drawing conclusions that go beyond the data at hand.

Our ability to do inference is determined by how the data are produced.

The logic of inference rests on asking, “What are the chance?”

The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the *count* or the *percent* of individuals who fall within each category.

Frequency tables display the counts (frequencies) of the variable in each category

Relative frequency tables show the percents (relative frequencies) of the variables in each category

Columns of numbers take time to read. You can use a pie chart or a bar graph to display the distribution of data easier.

Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories

Bar graphs represent each category as a bar. The bar heights show the category counts or percents

When you draw a bar graph, make sure to make the bars *equally wide*. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.

Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.

To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the marginal distribution.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table

Percents are often more informative than counts, especially when we comparing groups of different sizes.

Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a conditional distribution.

A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable

There are *two sets* of conditional distributions for any two-way table: one for the column variable and one for the row variable.

We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.

We say that there’s an association between two variables if knowing the value of one variable helps predict the value of the other

**IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.**

One of the simplest graphs to construct and interpret is a dotplot where each data value is shown as a dot above its location on a number line.

In any graph, look for the **overall pattern** and for striking **departures** from that pattern.

You can describe the overall pattern of a distribution by its

**shape, center,**and**spread**.An important kind of departure is an

**outlier**, an individual value that falls outside the overall pattern.**SOCS**is the acronym to remember how to describe the pattern

When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough symmetry and clear skewness.

A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other

A distribution is skewed to the right if the right side of the graph is much longer than the left side. Its skewed to the left if the left side of the graph is much longer than the right side.

Whether they are skewed left or right, most graphs are unimodal, meaning they have a single peak. A graph with two clear peaks is bimodal and with more than that, its multimodal.

You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.

Make sure to discuss the distributions of the *samples*.

Another common graph is a stemplot (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.

To better view and compare data, there’s a few different ways to arrange stemplots.

Splitting stems involves placing leaves 0-4 on one stem and 5-9 on another stem to view the spread of the data easier

Back-to-back stemplots help to compare data by sharing stems and having data on either side

Stemplots don’t work well for large data sets where each stem has to have a lot of leaves

Too few or too many stems will make it hard to see the distribution’s shape

Rounding data is sometimes necessary to not have too many numbers (Ex: a salary of 42,549 can be written with a stem of 4 and a leaf of 3

A graph of the distribution can sometimes be clearer if nearby values are grouped together. A histogram is a graph that shows counts of certain classes.

Don’t confuse histograms and bar graphs

Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations

Just because a graph looks nice doesn’t make it a meaningful display of data

The most common measure of center is the ordinary arithmetic average, or mean.

To find the mean of a set of observations, add their values and divide by the number of observations

The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a *sample*.

The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is *not* a resistant measure of center.

The median is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.

To find the mean of a distribution:

Arrange all observations in order of size, from smallest to largest

If the number of observations

*n*is odd, the median is the center observation in the ordered listIf the number of observations

*n*is even, the median is the average of the two center observations in the ordered list

The median, unlike the mean, is a resistant measure of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.

A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the range.

To compute the range, subtract the smallest value from the largest value.

To calculate the quartiles:

Arrange the observations in increasing order and locate the median in the ordered list of observations

The first quartile Q1 is the median of the observations that are to the left of the median in the ordered list

The third quartile Q3 is the median of the observations that are to the right of the median in the ordered list.

The interquartile range (IQR) is defined as

IQR = Q3 - Q1

In addition to serving as a measure of spread, the interquartile range (IQR) is used in the 1.5 x IQR rule for identifying outliers

An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile

The five-number summary of a distribution consists of the smallest observation, written in order from smallest to largest

Minimum, Q1, median, Q3, maximum

The five-number summary leads to a new graph, the boxplot (also known as a box-and-whicker plot).

A central box is drawn from Q1 to Q3

A line in the box marks the median

Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers

Outliers are marked with a special symbol such as an asterisk (*)

The most common numerical description of a distribution is the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation and the variance measure spread by looking at how far the observations are from their mean.

The standard deviation Sx measures the typical distance of the values in a distribution from the mean. It’s calculated by finding an average of the squared deviations, then taking the squared root. This average squared deviation S^2x is called the variance

To find the standard deviation of *n* observations:

Find the distance of each observation from the mean and square each of these distances

Average the distances by dividing their sum by n-1

The standard deviations Sx is the square root of this average squared distance

When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.

How to organize a statistics problem: A four-step process

**State:**What’s the question you’re trying to answer?**Plan:**How will you go about answering the question? What statistical technique does this problem call for?**Do:**Make graphs and carry out needed calculations**Conclude:**Give your conclusion in the setting of the real-world problem

Statistics is the science of data.

To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s data analysis.

Any set of data contains information about some group of individuals. The characteristics we measure on each individual are called variables.

Individuals are the objects described by a set of data. Individuals may be people, animals, or things.

A variable is any characteristic of an individual. A variable can take different values for different individuals

Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.

Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.

A categorical variable places an individual into one of several groups or categories

A quantitative variable takes numerical values for which it makes sense to find an average

**IMPORTANT:** **Not every variable that takes number values is quantitative (ex: zip codes)**

Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s distribution.

The distribution of a variable tells us what values the variable takes and how often it takes these values.

Inference is the idea of drawing conclusions that go beyond the data at hand.

Our ability to do inference is determined by how the data are produced.

The logic of inference rests on asking, “What are the chance?”

The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the *count* or the *percent* of individuals who fall within each category.

Frequency tables display the counts (frequencies) of the variable in each category

Relative frequency tables show the percents (relative frequencies) of the variables in each category

Columns of numbers take time to read. You can use a pie chart or a bar graph to display the distribution of data easier.

Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories

Bar graphs represent each category as a bar. The bar heights show the category counts or percents

When you draw a bar graph, make sure to make the bars *equally wide*. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.

Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.

To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the marginal distribution.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table

Percents are often more informative than counts, especially when we comparing groups of different sizes.

Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a conditional distribution.

A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable

There are *two sets* of conditional distributions for any two-way table: one for the column variable and one for the row variable.

We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.

We say that there’s an association between two variables if knowing the value of one variable helps predict the value of the other

**IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.**

One of the simplest graphs to construct and interpret is a dotplot where each data value is shown as a dot above its location on a number line.

In any graph, look for the **overall pattern** and for striking **departures** from that pattern.

You can describe the overall pattern of a distribution by its

**shape, center,**and**spread**.An important kind of departure is an

**outlier**, an individual value that falls outside the overall pattern.**SOCS**is the acronym to remember how to describe the pattern

When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough symmetry and clear skewness.

A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other

A distribution is skewed to the right if the right side of the graph is much longer than the left side. Its skewed to the left if the left side of the graph is much longer than the right side.

Whether they are skewed left or right, most graphs are unimodal, meaning they have a single peak. A graph with two clear peaks is bimodal and with more than that, its multimodal.

You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.

Make sure to discuss the distributions of the *samples*.

Another common graph is a stemplot (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.

To better view and compare data, there’s a few different ways to arrange stemplots.

Splitting stems involves placing leaves 0-4 on one stem and 5-9 on another stem to view the spread of the data easier

Back-to-back stemplots help to compare data by sharing stems and having data on either side

Stemplots don’t work well for large data sets where each stem has to have a lot of leaves

Too few or too many stems will make it hard to see the distribution’s shape

Rounding data is sometimes necessary to not have too many numbers (Ex: a salary of 42,549 can be written with a stem of 4 and a leaf of 3

A graph of the distribution can sometimes be clearer if nearby values are grouped together. A histogram is a graph that shows counts of certain classes.

Don’t confuse histograms and bar graphs

Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations

Just because a graph looks nice doesn’t make it a meaningful display of data

The most common measure of center is the ordinary arithmetic average, or mean.

To find the mean of a set of observations, add their values and divide by the number of observations

The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a *sample*.

The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is *not* a resistant measure of center.

The median is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.

To find the mean of a distribution:

Arrange all observations in order of size, from smallest to largest

If the number of observations

*n*is odd, the median is the center observation in the ordered listIf the number of observations

*n*is even, the median is the average of the two center observations in the ordered list

The median, unlike the mean, is a resistant measure of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.

A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the range.

To compute the range, subtract the smallest value from the largest value.

To calculate the quartiles:

Arrange the observations in increasing order and locate the median in the ordered list of observations

The first quartile Q1 is the median of the observations that are to the left of the median in the ordered list

The third quartile Q3 is the median of the observations that are to the right of the median in the ordered list.

The interquartile range (IQR) is defined as

IQR = Q3 - Q1

In addition to serving as a measure of spread, the interquartile range (IQR) is used in the 1.5 x IQR rule for identifying outliers

An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile

The five-number summary of a distribution consists of the smallest observation, written in order from smallest to largest

Minimum, Q1, median, Q3, maximum

The five-number summary leads to a new graph, the boxplot (also known as a box-and-whicker plot).

A central box is drawn from Q1 to Q3

A line in the box marks the median

Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers

Outliers are marked with a special symbol such as an asterisk (*)

The most common numerical description of a distribution is the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation and the variance measure spread by looking at how far the observations are from their mean.

The standard deviation Sx measures the typical distance of the values in a distribution from the mean. It’s calculated by finding an average of the squared deviations, then taking the squared root. This average squared deviation S^2x is called the variance

To find the standard deviation of *n* observations:

Find the distance of each observation from the mean and square each of these distances

Average the distances by dividing their sum by n-1

The standard deviations Sx is the square root of this average squared distance

When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.

How to organize a statistics problem: A four-step process

**State:**What’s the question you’re trying to answer?**Plan:**How will you go about answering the question? What statistical technique does this problem call for?**Do:**Make graphs and carry out needed calculations**Conclude:**Give your conclusion in the setting of the real-world problem