14.7 Intro to Statistical Thinking

In **statistical thinking**, we make judgments about an entire population based on a sample.

A

**population**consists of*all*the individuals in a group.A

**sample**is any subset of the population.In

**survey sampling**, data are collected through responses to questionnaires.In

**observational studies**, data are collected by observation.In

**experimental studies**, data are collected by performing experiments.

A

**random sample**is one that is selected arbitrarily and without bias.The importance of randomness cannot be overemphasized.

The larger the sample size, the more closely the sample properties approximate those of the population.

**non-random samples**are generated when there is bias in the selection process.Non random samples are useless because the sample is not representative of the population.

A

**simple random sample**is one in which every individual in the population has the same probability of being selectedTo satisfy this requirement, the sampling method that is used must be free of bias with respect to the property being measured.

**Under coverage bias**(or**exclusion bias**), in which part of the population is excluded from the sampling process**Response bias**, in which the wording of a questionnaire is not neutral but rather suggest or provokes a particular response**Nonresponse bias**, in which individuals with a common characteristic are unwilling (or neglect) to respond to a questionnaire. (Notice that this is not the opposite of response bias)**Self-selection bias**(or**voluntary response bias**), in which individuals select themselves (or volunteer) for the sample

**Convenience sampling**is where individuals are sampled only because they are nearby or easily accessible

Extraneous or unintended variables that systematically affect the property being studied are called

**confounding variables**.Such variables are said to

**confound**(or mix up) the results of the study.To eliminate or vastly reduce the effects of confounding variables, researchers often conduct

*experiments*so that such variables can be*controlled*.

In an

**experimental study**, two groups are selected: a**treatment group**(in which individuals are given a treatment) and a**control group**(in which individuals are not given the treatment).The individuals in the experiment are called

**subjects**(or**experimental units**).The goal is to measure the

**response**of the subjects to the treatment

A common confounding factor is the

**placebo effect**, in which patients who*think*they are receiving a medication report an improvement (perceived or actual)Even though the “treatment” they received was a

**placebo**—a simulated or false treatment (sometimes called a “sugar pill”).

Statistical conclusions are based on probability and are always accompanied by a confidence level.

The 95%

**confidence level**means that there is less than a 5% chance (or 0.05 probability) that the result obtained from the sample could be obtained by chance alone.In the popular press, poll results are accompanied by a

*margin of error*.

At 95% confidence level, the margin of error

*d*and the sample size*n*are related by the formula*d*=(1.96)/2√*n*

**Two-variable data**measures two properties of each individual.Two-variable data can be graphed in a coordinate plane, resulting in a

**scatter plot**.

Analyzing such data mathematically by finding the line that best fits the data is called finding the the

**regression line**.Associated with the regression line is a

**correlation coefficient**, which is a mathematical measure of how well the data fit along the regression line, or how well the two variables are**correlated**.The correlation coefficient

*r*is between -1 and 1. If*r*is close to zero, the variables have little correlation.The closer

*r*is to 1 or -1, the closer the data points are to the regression line.

In statistics, a question of interest when studying two-variable data is whether or not the correlation is statistically significant.

__what is the probability that the correlation in the sample is due to chance alone?__If the sample consists of only three individuals, even a strong correlation coefficient may not be significant.

For a large sample, a small correlation coefficient may be significant.

This is because if there is no correlation at all in the population, it’s very unlikely that a large random sample would produce data that have a linear trend

whereas a small sample is more likely to produce correlated data by chance alone (this is why a large sample size is VERY important).

__Correlation is not always causation__.

In **statistical thinking**, we make judgments about an entire population based on a sample.

A

**population**consists of*all*the individuals in a group.A

**sample**is any subset of the population.In

**survey sampling**, data are collected through responses to questionnaires.In

**observational studies**, data are collected by observation.In

**experimental studies**, data are collected by performing experiments.

A

**random sample**is one that is selected arbitrarily and without bias.The importance of randomness cannot be overemphasized.

The larger the sample size, the more closely the sample properties approximate those of the population.

**non-random samples**are generated when there is bias in the selection process.Non random samples are useless because the sample is not representative of the population.

A

**simple random sample**is one in which every individual in the population has the same probability of being selectedTo satisfy this requirement, the sampling method that is used must be free of bias with respect to the property being measured.

**Under coverage bias**(or**exclusion bias**), in which part of the population is excluded from the sampling process**Response bias**, in which the wording of a questionnaire is not neutral but rather suggest or provokes a particular response**Nonresponse bias**, in which individuals with a common characteristic are unwilling (or neglect) to respond to a questionnaire. (Notice that this is not the opposite of response bias)**Self-selection bias**(or**voluntary response bias**), in which individuals select themselves (or volunteer) for the sample

**Convenience sampling**is where individuals are sampled only because they are nearby or easily accessible

Extraneous or unintended variables that systematically affect the property being studied are called

**confounding variables**.Such variables are said to

**confound**(or mix up) the results of the study.To eliminate or vastly reduce the effects of confounding variables, researchers often conduct

*experiments*so that such variables can be*controlled*.

In an

**experimental study**, two groups are selected: a**treatment group**(in which individuals are given a treatment) and a**control group**(in which individuals are not given the treatment).The individuals in the experiment are called

**subjects**(or**experimental units**).The goal is to measure the

**response**of the subjects to the treatment

A common confounding factor is the

**placebo effect**, in which patients who*think*they are receiving a medication report an improvement (perceived or actual)Even though the “treatment” they received was a

**placebo**—a simulated or false treatment (sometimes called a “sugar pill”).

Statistical conclusions are based on probability and are always accompanied by a confidence level.

The 95%

**confidence level**means that there is less than a 5% chance (or 0.05 probability) that the result obtained from the sample could be obtained by chance alone.In the popular press, poll results are accompanied by a

*margin of error*.

At 95% confidence level, the margin of error

*d*and the sample size*n*are related by the formula*d*=(1.96)/2√*n*

**Two-variable data**measures two properties of each individual.Two-variable data can be graphed in a coordinate plane, resulting in a

**scatter plot**.

Analyzing such data mathematically by finding the line that best fits the data is called finding the the

**regression line**.Associated with the regression line is a

**correlation coefficient**, which is a mathematical measure of how well the data fit along the regression line, or how well the two variables are**correlated**.The correlation coefficient

*r*is between -1 and 1. If*r*is close to zero, the variables have little correlation.The closer

*r*is to 1 or -1, the closer the data points are to the regression line.

In statistics, a question of interest when studying two-variable data is whether or not the correlation is statistically significant.

__what is the probability that the correlation in the sample is due to chance alone?__If the sample consists of only three individuals, even a strong correlation coefficient may not be significant.

For a large sample, a small correlation coefficient may be significant.

This is because if there is no correlation at all in the population, it’s very unlikely that a large random sample would produce data that have a linear trend

whereas a small sample is more likely to produce correlated data by chance alone (this is why a large sample size is VERY important).

__Correlation is not always causation__.