In statistical thinking, we make judgments about an entire population based on a sample.
- A population consists of all the individuals in a group.
- A sample is any subset of the population.
- In survey sampling, data are collected through responses to questionnaires.
- In observational studies, data are collected by observation.
- In experimental studies, data are collected by performing experiments.
%%The Key Role of Randomness%%
- A random sample is one that is selected arbitrarily and without bias.
- The importance of randomness cannot be overemphasized.
- The larger the sample size, the more closely the sample properties approximate those of the population.
- non-random samples are generated when there is bias in the selection process.
- Non random samples are useless because the sample is not representative of the population.
- A simple random sample is one in which every individual in the population has the same probability of being selected
- To satisfy this requirement, the sampling method that is used must be free of bias with respect to the property being measured.
Common Types of Sampling Bias
- Under coverage bias (or exclusion bias), in which part of the population is excluded from the sampling process
- Response bias, in which the wording of a questionnaire is not neutral but rather suggest or provokes a particular response
- Nonresponse bias, in which individuals with a common characteristic are unwilling (or neglect) to respond to a questionnaire. (Notice that this is not the opposite of response bias)
- Self-selection bias (or voluntary response bias), in which individuals select themselves (or volunteer) for the sample
- Convenience sampling is where individuals are sampled only because they are nearby or easily accessible
%%Design of Experiments%%
Extraneous or unintended variables that systematically affect the property being studied are called confounding variables.
- Such variables are said to confound (or mix up) the results of the study.
- To eliminate or vastly reduce the effects of confounding variables, researchers often conduct experiments so that such variables can be controlled.
In an experimental study, two groups are selected: a treatment group (in which individuals are given a treatment) and a control group (in which individuals are not given the treatment).
- The individuals in the experiment are called subjects (or experimental units).
- The goal is to measure the response of the subjects to the treatment
A common confounding factor is the placebo effect, in which patients who think they are receiving a medication report an improvement (perceived or actual)
- Even though the “treatment” they received was a placebo—a simulated or false treatment (sometimes called a “sugar pill”).
%%Sample Size and Margin of Error%%
- Statistical conclusions are based on probability and are always accompanied by a confidence level.
- The 95% confidence level means that there is less than a 5% chance (or 0.05 probability) that the result obtained from the sample could be obtained by chance alone.
- In the popular press, poll results are accompanied by a margin of error.
- At 95% confidence level, the margin of error d and the sample size n are related by the formula d=(1.96)/2√ n
%%Two-Variable Data and Correlation%%
Two-variable data measures two properties of each individual.
- Two-variable data can be graphed in a coordinate plane, resulting in a scatter plot.
Analyzing such data mathematically by finding the line that best fits the data is called finding the the regression line.
Associated with the regression line is a correlation coefficient, which is a mathematical measure of how well the data fit along the regression line, or how well the two variables are correlated.
- The correlation coefficient r is between -1 and 1. If r is close to zero, the variables have little correlation.
- The closer r is to 1 or -1, the closer the data points are to the regression line.
In statistics, a question of interest when studying two-variable data is whether or not the correlation is statistically significant.
what is the probability that the correlation in the sample is due to chance alone?
- If the sample consists of only three individuals, even a strong correlation coefficient may not be significant.
- For a large sample, a small correlation coefficient may be significant.
- This is because if there is no correlation at all in the population, it’s very unlikely that a large random sample would produce data that have a linear trend
- whereas a small sample is more likely to produce correlated data by chance alone (this is why a large sample size is VERY important).
Correlation is not always causation.
\