14.7 Intro to Statistical Thinking
In statistical thinking, we make judgments about an entire population based on a sample.
- A population consists of all the individuals in a group.
- A sample is any subset of the population.
- In survey sampling, data are collected through responses to questionnaires.
- In observational studies, data are collected by observation.
- In experimental studies, data are collected by performing experiments.
%%The Key Role of Randomness%%
- A random sample is one that is selected arbitrarily and without bias.
* The importance of randomness cannot be overemphasized.
* The larger the sample size, the more closely the sample properties approximate those of the population. - non-random samples are generated when there is bias in the selection process.
* Non random samples are useless because the sample is not representative of the population. - A simple random sample is one in which every individual in the population has the same probability of being selected
* To satisfy this requirement, the sampling method that is used must be free of bias with respect to the property being measured.
Common Types of Sampling Bias
- Under coverage bias (or exclusion bias), in which part of the population is excluded from the sampling process
- Response bias, in which the wording of a questionnaire is not neutral but rather suggest or provokes a particular response
- Nonresponse bias, in which individuals with a common characteristic are unwilling (or neglect) to respond to a questionnaire. (Notice that this is not the opposite of response bias)
- Self-selection bias (or voluntary response bias), in which individuals select themselves (or volunteer) for the sample
- Convenience sampling is where individuals are sampled only because they are nearby or easily accessible
%%Design of Experiments%%
Extraneous or unintended variables that systematically affect the property being studied are called confounding variables.
* Such variables are said to confound (or mix up) the results of the study.
* To eliminate or vastly reduce the effects of confounding variables, researchers often conduct experiments so that such variables can be controlled.In an experimental study, two groups are selected: a treatment group (in which individuals are given a treatment) and a control group (in which individuals are not given the treatment).
* The individuals in the experiment are called subjects (or experimental units).
* The goal is to measure the response of the subjects to the treatmentA common confounding factor is the placebo effect, in which patients who think they are receiving a medication report an improvement (perceived or actual)
* Even though the “treatment” they received was a placebo—a simulated or false treatment (sometimes called a “sugar pill”).
%%Sample Size and Margin of Error%%
- Statistical conclusions are based on probability and are always accompanied by a confidence level.
* The 95% confidence level means that there is less than a 5% chance (or 0.05 probability) that the result obtained from the sample could be obtained by chance alone.
* In the popular press, poll results are accompanied by a margin of error. - At 95% confidence level, the margin of error d and the sample size n are related by the formula d=(1.96)/2√ n
%%Two-Variable Data and Correlation%%
Two-variable data measures two properties of each individual.
* Two-variable data can be graphed in a coordinate plane, resulting in a scatter plot.Analyzing such data mathematically by finding the line that best fits the data is called finding the the regression line.
Associated with the regression line is a correlation coefficient, which is a mathematical measure of how well the data fit along the regression line, or how well the two variables are correlated.
* The correlation coefficient r is between -1 and 1. If r is close to zero, the variables have little correlation.
* The closer r is to 1 or -1, the closer the data points are to the regression line.In statistics, a question of interest when studying two-variable data is whether or not the correlation is statistically significant.
what is the probability that the correlation in the sample is due to chance alone?
* If the sample consists of only three individuals, even a strong correlation coefficient may not be significant.
* For a large sample, a small correlation coefficient may be significant.
* This is because if there is no correlation at all in the population, it’s very unlikely that a large random sample would produce data that have a linear trend
* whereas a small sample is more likely to produce correlated data by chance alone (this is why a large sample size is VERY important).
Correlation is not always causation.