%%Statistics%% is the science of data.
To hear what data is saying, we need to help it speak by organizing, displaying, summarizing, and asking questions. That’s %%data analysis%%.
Any set of data contains information about some group of %%individuals%%. The characteristics we measure on each individual are called %%variables%%.
Ex: In a database of all students attending a high school, the students are the individuals. The data contains values of variables such as age, gender, GPA, homeroom, and grade level.
\
Some variables, like gender and grade level, assign labels to individuals that place them in categories. Others, like age and GPA, take numerical values that we can do math with.
IMPORTANT: Not every variable that takes number values is quantitative (ex: zip codes)
\
Most data tables follow the format of each row as an individual, and each column as a variable. Categorical variables sometimes have similar counts in each category and sometimes don’t. Quantitative variables may take values that are very close together or values that are spread out. This pattern of variation of a variable is it’s %%distribution%%.
%%Inference%% is the idea of drawing conclusions that go beyond the data at hand.
Our ability to do inference is determined by how the data are produced.
The logic of inference rests on asking, “What are the chance?”
The values of a categorical variables are labels for the categories, such as “male” and “female”. The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall within each category.
Columns of numbers take time to read. You can use a %%pie chart%% or a %%bar graph%% to display the distribution of data easier.
When you draw a bar graph, make sure to make the bars equally wide. This ensures that your eyes won’t deceive you by creating bars that are simply compared by height.
Another issue to pay attention to is the scale. By starting the y axis at a number other than 0, it can seem that the data is different.
To best grasp the information in a two-way table, start by looking at the distribution of each variable separately as a single variable, or the %%marginal distribution%%.
Percents are often more informative than counts, especially when we comparing groups of different sizes.
Marginal distributions tell us nothing about the relationship between two variables. To analyze the relationship between the variables, we must use a %%conditional distribution%%.
There are two sets of conditional distributions for any two-way table: one for the column variable and one for the row variable.
We could also use a segmented bar graph or a side-by-side bar graph to compare the distributions of conditional variables. Both graphs can be useful and can provide evidence of association between variables.
IMPORTANT: Even a strong association between two categorical variables can be influenced by other variables lurking in the background.
One of the simplest graphs to construct and interpret is a %%dotplot%% where each data value is shown as a dot above its location on a number line.
In any graph, look for the overall pattern and for striking departures from that pattern.
When describing the shape of a distribution, focus on the main features such as major peaks, clusters and gaps, potential outliers, and rough %%symmetry%% and clear %%skewness%%.
Whether they are skewed left or right, most graphs are %%unimodal%%, meaning they have a single peak. A graph with two clear peaks is %%bimodal%% and with more than that, its %%multimodal%%.
You should always discuss shape, center, spread, and possible outliers whenever comparing distributions of a quantitative variable.
Make sure to discuss the distributions of the samples.
Another common graph is a %%stemplot%% (also called a stem-and-leaf plot). Stemplots give a quick picture of the shape of a distribution while also including the actual numerical data.
To better view and compare data, there’s a few different ways to arrange stemplots.
A graph of the distribution can sometimes be clearer if nearby values are grouped together. A %%histogram%% is a graph that shows counts of certain classes.
The most common measure of center is the ordinary arithmetic average, or %%mean%%.
The notation “x-bar” is commonly used to refer to the mean, but it only refers to the mean of a sample.
The mean’s weakness as a measure of center is that it’s sensitive to the influence of extreme observations. In other words, in a skewed data set, the outliers pull the mean towards the tail. Because of this, the mean is not a %%resistant measure%% of center.
The %%median%% is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger.
To find the mean of a distribution:
The median, unlike the mean, is a %%resistant measure%% of center. The outlier just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and will chase a single large observation upward.
A useful numerical description of a distribution requires both a measure of center and a measure of spread. The simplest measure of variability is the %%range%%.
To calculate the quartiles:
The %%interquartile range (IQR)%% is defined as
IQR = Q3 - Q1
In addition to serving as a measure of spread, the interquartile range (IQR) is used in the %%1.5 x IQR rule%% for identifying outliers
The %%five-number summary%% of a distribution consists of the smallest observation, written in order from smallest to largest
The five-number summary leads to a new graph, the %%boxplot%% (also known as a box-and-whicker plot).
The most common numerical description of a distribution is the combination of the mean to measure center and the %%standard deviation%% to measure spread. The standard deviation and the %%variance%% measure spread by looking at how far the observations are from their mean.
To find the standard deviation of n observations:
When choosing measures of spread, its important to pay attention to outliers and skewedness of the distribution. Because the mean and standard deviation are sensitive to extreme observations, they can be misleading when a distribution is strongly skewed or has outliers. In these cases, the median and IQR, which are both resistant to extreme values, provide a better summary.
How to organize a statistics problem: A four-step process
\