14.7 Intro to Statistical Thinking

In statistical thinking, we make judgments about an entire population based on a sample.

A random sample is one that is selected arbitrarily and without bias.
- The importance of randomness cannot be overemphasized.
- The larger the sample size, the more closely the sample properties approximate those of the population.
non-random samples are generated when there is bias in the selection process.
- Non random samples are useless because the sample is not representative of the population.
A simple random sample is one in which every individual in the population has the same probability of being selected
- To satisfy this requirement, the sampling method that is used must be free of bias with respect to the property being measured.

Under coverage bias (or exclusion bias), in which part of the population is excluded from the sampling process
Response bias, in which the wording of a questionnaire is not neutral but rather suggest or provokes a particular response
Nonresponse bias, in which individuals with a common characteristic are unwilling (or neglect) to respond to a questionnaire. (Notice that this is not the opposite of response bias)
Self-selection bias (or voluntary response bias), in which individuals select themselves (or volunteer) for the sample

Convenience sampling is where individuals are sampled only because they are nearby or easily accessible

%%Design of Experiments%%

Extraneous or unintended variables that systematically affect the property being studied are called confounding variables.
- Such variables are said to confound (or mix up) the results of the study.
- To eliminate or vastly reduce the effects of confounding variables, researchers often conduct experiments so that such variables can be controlled.
In an experimental study, two groups are selected: a treatment group (in which individuals are given a treatment) and a control group (in which individuals are not given the treatment).
- The individuals in the experiment are called subjects (or experimental units).
- The goal is to measure the response of the subjects to the treatment
A common confounding factor is the placebo effect, in which patients who think they are receiving a medication report an improvement (perceived or actual)
- Even though the “treatment” they received was a placebo—a simulated or false treatment (sometimes called a “sugar pill”).

Statistical conclusions are based on probability and are always accompanied by a confidence level.
- The 95% confidence level means that there is less than a 5% chance (or 0.05 probability) that the result obtained from the sample could be obtained by chance alone.
- In the popular press, poll results are accompanied by a margin of error.
At 95% confidence level, the margin of error d and the sample size n are related by the formula d=(1.96)/2√ n

Two-variable data measures two properties of each individual.
- Two-variable data can be graphed in a coordinate plane, resulting in a scatter plot.
Analyzing such data mathematically by finding the line that best fits the data is called finding the the regression line.
Associated with the regression line is a correlation coefficient, which is a mathematical measure of how well the data fit along the regression line, or how well the two variables are correlated.
- The correlation coefficient r is between -1 and 1. If r is close to zero, the variables have little correlation.
- The closer r is to 1 or -1, the closer the data points are to the regression line.
In statistics, a question of interest when studying two-variable data is whether or not the correlation is statistically significant.
what is the probability that the correlation in the sample is due to chance alone?
- If the sample consists of only three individuals, even a strong correlation coefficient may not be significant.
- For a large sample, a small correlation coefficient may be significant.
- This is because if there is no correlation at all in the population, it’s very unlikely that a large random sample would produce data that have a linear trend
- whereas a small sample is more likely to produce correlated data by chance alone (this is why a large sample size is VERY important).

Correlation is not always causation.