Note

5.0(1)

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

AP Statistics Study Guides

AP Statistics Ultimate Guide

Unit 1: Exploring One-Variable Data

Unit 2: Exploring Two-Variable Data

Unit 3: Collecting Data

Unit 4: Probability, Random Variables, and Probability Distributions

Unit 5: Sampling Distributions

Unit 6: Inference for Categorical Data: Proportions

Unit 7: Inference for Quantitative Data: Means

Unit 8: Inference for Categorical Data: Chi-Square

Unit 9: Inference for Quantitative Data: Slopes

Top Exams

AP English Language and Composition

AP Biology

AP United States History

Studying for another AP Exam?

Check out our other AP study guides

Exploring Data [The Practice of Statistics- Chapter 1]

^^Introduction- Data Analysis: Making Sense of Data^^

%%Statistics%% is the science of data.

To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s %%data analysis%%.

Individuals and Variables

Any set of data contains information about some group of %%individuals%%. The characteristics we measure on each individual are called %%variables%%.

%%Individuals%% are the objects described by a set of data. Individuals may be people, animals, or things.
A %%variable%% is any characteristic of an individual. A variable can take different values for different individuals

Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.

Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.

A %%categorical variable%% places an individual into one of several groups or categories
A %%quantitative variable%% takes numerical values for which it makes sense to find an average

IMPORTANT: Not every variable that takes number values is quantitative (ex: zip codes)

Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s %%distribution%%.

The %%distribution%% of a variable tells us what values the variable takes and how often it takes these values.

From Data Analysis to Inference

%%Inference%% is the idea of drawing conclusions that go beyond the data at hand.

Our ability to do inference is determined by how the data are produced.

The logic of inference rests on asking, “What are the chance?”

^^1.1- Analyzing Categorical Data^^

The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall within each category.

%%Frequency tables%% display the counts (frequencies) of the variable in each category
%%Relative frequency tables%% show the percents (relative frequencies) of the variables in each category

Bar Graphs and Pie Charts

Columns of numbers take time to read. You can use a %%pie chart%% or a %%bar graph%% to display the distribution of data easier.

%%Pie charts%% show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories
%%Bar graphs%% represent each category as a bar. The bar heights show the category counts or percents

Graphs: Good and Bad

When you draw a bar graph, make sure to make the bars equally wide. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.

Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.

Two-Way Tables and Marginal Distributions

To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the %%marginal distribution%%.

The %%marginal distribution%% of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table

Percents are often more informative than counts, especially when we comparing groups of different sizes.

Relationships between Categorical Variables: Conditional Distributions

Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a %%conditional distribution%%.

A %%conditional distribution%% of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable

There are two sets of conditional distributions for any two-way table: one for the column variable and one for the row variable.

Putting It All Together: Relationships Between Categorical Variables

We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.

We say that there’s an %%association%% between two variables if knowing the value of one variable helps predict the value of the other

IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.

^^1.2- Displaying Quantitative Data with Graphs^^

Dotplots

One of the simplest graphs to construct and interpret is a %%dotplot%% where each data value is shown as a dot above its location on a number line.

How to Examine the Distribution of a Quantitative Variable

In any graph, look for the overall pattern and for striking departures from that pattern.

You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of departure is an outlier, an individual value that falls outside the overall pattern.
SOCS is the acronym to remember how to describe the pattern

Describing Shape

When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough %%symmetry%% and clear %%skewness%%.

A distribution is roughly %%symmetric%% if the right and left sides of the graph are approximately mirror images of each other
A distribution is %%skewed to the right%% if the right side of the graph is much longer than the left side. Its %%skewed to the left%% if the left side of the graph is much longer than the right side.

Whether they are skewed left or right, most graphs are %%unimodal%%, meaning they have a single peak. A graph with two clear peaks is %%bimodal%% and with more than that, its %%multimodal%%.

Comparing Distributions

You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.

Make sure to discuss the distributions of the samples.

Stemplots

Another common graph is a %%stemplot%% (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.

To better view and compare data, there’s a few different ways to arrange stemplots.

%%Splitting stems%% involves placing leaves 0-4 on one stem and 5-9 on another stem to view the spread of the data easier
%%Back-to-back stemplots%% help to compare data by sharing stems and having data on either side

Tips for making stemplots

Stemplots don’t work well for large data sets where each stem has to have a lot of leaves
Too few or too many stems will make it hard to see the distribution’s shape
Rounding data is sometimes necessary to not have too many numbers (Ex: a salary of 42,549 can be written with a stem of 4 and a leaf of 3

Histograms

A graph of the distribution can sometimes be clearer if nearby values are grouped together. A %%histogram%% is a graph that shows counts of certain classes.

Using Histograms Wisely

Don’t confuse histograms and bar graphs
Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations
Just because a graph looks nice doesn’t make it a meaningful display of data

^^1.3- Describing Quantitative Data with Numbers^^

Measuring Center: The Mean

The most common measure of center is the ordinary arithmetic average, or %%mean%%.

To find the %%mean%% of a set of observations, add their values and divide by the number of observations

The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a sample.

The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is not a %%resistant measure%% of center.

Measuring Center: The Median

The %%median%% is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.

To find the mean of a distribution:

Arrange all observations in order of size, from smallest to largest
If the number of observations n is odd, the median is the center observation in the ordered list
If the number of observations n is even, the median is the average of the two center observations in the ordered list

Comparing the Mean and the Median

The median, unlike the mean, is a %%resistant measure%% of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.

Measuring Spread: Range and Interquartile Range (IQR)

A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the %%range%%.

To compute the range, subtract the smallest value from the largest value.

How to Calculate the Quartiles Q1 and Q3 and the Interquartile Range (IQR)

To calculate the quartiles:

Arrange the observations in increasing order and locate the median in the ordered list of observations
The %%first quartile Q1%% is the median of the observations that are to the left of the median in the ordered list
The %%third quartile Q3%% is the median of the observations that are to the right of the median in the ordered list.

The %%interquartile range (IQR)%% is defined as

IQR = Q3 - Q1

Identifying outliers

In addition to serving as a measure of spread, the interquartile range (IQR) is used in the %%1.5 x IQR rule%% for identifying outliers

An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile

The Five-Number Summary and Boxplots

The %%five-number summary%% of a distribution consists of the smallest observation, written in order from smallest to largest

Minimum, Q1, median, Q3, maximum

The five-number summary leads to a new graph, the %%boxplot%% (also known as a box-and-whicker plot).

How to Make a Boxplot

A central box is drawn from Q1 to Q3
A line in the box marks the median
Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers
Outliers are marked with a special symbol such as an asterisk (*)

Measuring Spread: The Standard Deviation

The most common numerical description of a distribution is the combination of the mean to measure center and the %%standard deviation%% to measure spread. The standard deviation and the %%variance%% measure spread by looking at how far the observations are from their mean.

The standard deviation Sx measures the typical distance of the values in a distribution from the mean. It’s calculated by finding an average of the squared deviations, then taking the squared root. This average squared deviation S^2x is called the variance

How to Find the Standard Deviation

To find the standard deviation of n observations:

Find the distance of each observation from the mean and square each of these distances
Average the distances by dividing their sum by n-1
The standard deviations Sx is the square root of this average squared distance

Choosing Measures of Center and Spread

When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.

Organizing a statistics problem

How to organize a statistics problem: A four-step process

State: What’s the question you’re trying to answer?
Plan: How will you go about answering the question? What statistical technique does this problem call for?
Do: Make graphs and carry out needed calculations
Conclude: Give your conclusion in the setting of the real-world problem

Note