Chapter 1: Sampling and Data

## 1.1 Definitions of Statistics, Probability, and Key Terms

**Statistics**: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.**Data**: collections of observations.**Descriptive Statistics**: organizing and summarizing data; by graphing and by numerical values (such as an average).**Inferential Statistics**: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.**Probability**: the chance of an event occurring.**Population**: the complete collection of all individuals to be studied.**Sample**: a subcollection of members selected from a population.**Sampling**: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.**Parameter**: a numerical measurement describing some characteristic of a population.**Statistic**: a numerical measurement describing some characteristic of a sample.**Representative Sample**- the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.**Variable**: a characteristic or measurement that can be determined for each member of a population.**Mean**: or “average.”**Proportion**: part out of the whole/total.

## 1.2 Data, Sampling, and Variation in Data and Sampling

**Quantitative (or numerical) data**: data that consists of numbers representing counts or measurements.**Qualitative (or Categorical) data**: data that consists of names or labels that are not numbers representing counts or measurements.**Discrete data**: quantitative data which results when the number of possible values is either a finite number or a countable number.**Continuous data**: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.**Pie Chart**: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.**Bar Graph**: the length of the bar for each category is proportional to the number or percent of individuals in each category.**Pareto chart**: consists of bars that are sorted into order by category size (largest to smallest).

### Sampling Methods

**Simple random sample**: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.**Systematic sample**: A sample in which the researcher selects some starting point and then selects every kth element in the population.**Stratified sample**: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.**Cluster sample**: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.**Convenience sample**: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.**Bias**: if the results of the sample are not representative of the population.

### Sources of Bias in Sampling

**Sampling bias**: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another**Nonresponse bias**: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.**Response bias**: when answers on a survey do not reflect the true feelings of the respondent.**Interview error**: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.**Misrepresented Answers**: some survey questions result in responses that misrepresent facts or are flat-out lies.**Loaded Questions:**The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form**Ordering of Questions/Words**: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.**Data-entry error**: not technically a result of response bias, data-entry errors will lead to results not representative of the population

### Common Problems

**Problems with samples**: A sample must be representative of the population. A sample that is not representative of the population is biased.**Self-selected samples**: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.**Sample size issues**: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.**Undue influence**: collecting data or asking questions in a way that influences the response.**Non-response or refusal of participation**: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.**Causality**: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.**Misleading use of data**: improperly displayed graphs, incomplete data, or lack of context.**Confounding**: When the effects of multiple factors on a response cannot be separated.

## 1.3 Frequency, Frequency Tables, and Levels of Measurement

**Frequency**: The number of times a value of the data occurs**Relative frequency**: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.**Cumulative relative frequency**: The accumulation of the previous relative frequencies

### Levels of Measurement

**Nominal scale level**: data that cannot be ordered nor can it be used in calculations**Ordinal scale level**: data that can be ordered; the differences cannot be measured**Interval scale level**: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.**Ratio scale level**: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated

## 1.4 Experimental Design and Ethics

**Explanatory variable**: The variable whose effect you want to study; the independent variable.**Response variable**: the variable that you suspect is affected by the other variable; the dependent variable.**Experimental Unit**: a single object or individual to be measured**Placebo**: A treatment that cannot influence the response variable**Double-blinded experiment**: one in which both the subjects and the researchers involved with the subjects are blinded.**Nonsampling Error**: an issue that affects the reliability of sampling data other than natural variation**Institutional Review Board**: a committee tasked with oversight of research programs that involve human subjects

# Chapter 1: Sampling and Data

## 1.1 Definitions of Statistics, Probability, and Key Terms

**Statistics**: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.**Data**: collections of observations.**Descriptive Statistics**: organizing and summarizing data; by graphing and by numerical values (such as an average).**Inferential Statistics**: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.**Probability**: the chance of an event occurring.**Population**: the complete collection of all individuals to be studied.**Sample**: a subcollection of members selected from a population.**Sampling**: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.**Parameter**: a numerical measurement describing some characteristic of a population.**Statistic**: a numerical measurement describing some characteristic of a sample.**Representative Sample**- the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.**Variable**: a characteristic or measurement that can be determined for each member of a population.**Mean**: or “average.”**Proportion**: part out of the whole/total.

## 1.2 Data, Sampling, and Variation in Data and Sampling

**Quantitative (or numerical) data**: data that consists of numbers representing counts or measurements.**Qualitative (or Categorical) data**: data that consists of names or labels that are not numbers representing counts or measurements.**Discrete data**: quantitative data which results when the number of possible values is either a finite number or a countable number.**Continuous data**: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.**Pie Chart**: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.**Bar Graph**: the length of the bar for each category is proportional to the number or percent of individuals in each category.**Pareto chart**: consists of bars that are sorted into order by category size (largest to smallest).

### Sampling Methods

**Simple random sample**: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.**Systematic sample**: A sample in which the researcher selects some starting point and then selects every kth element in the population.**Stratified sample**: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.**Cluster sample**: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.**Convenience sample**: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.**Bias**: if the results of the sample are not representative of the population.

### Sources of Bias in Sampling

**Sampling bias**: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another**Nonresponse bias**: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.**Response bias**: when answers on a survey do not reflect the true feelings of the respondent.**Interview error**: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.**Misrepresented Answers**: some survey questions result in responses that misrepresent facts or are flat-out lies.**Loaded Questions:**The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form**Ordering of Questions/Words**: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.**Data-entry error**: not technically a result of response bias, data-entry errors will lead to results not representative of the population

### Common Problems

**Problems with samples**: A sample must be representative of the population. A sample that is not representative of the population is biased.**Self-selected samples**: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.**Sample size issues**: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.**Undue influence**: collecting data or asking questions in a way that influences the response.**Non-response or refusal of participation**: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.**Causality**: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.**Misleading use of data**: improperly displayed graphs, incomplete data, or lack of context.**Confounding**: When the effects of multiple factors on a response cannot be separated.

## 1.3 Frequency, Frequency Tables, and Levels of Measurement

**Frequency**: The number of times a value of the data occurs**Relative frequency**: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.**Cumulative relative frequency**: The accumulation of the previous relative frequencies

### Levels of Measurement

**Nominal scale level**: data that cannot be ordered nor can it be used in calculations**Ordinal scale level**: data that can be ordered; the differences cannot be measured**Interval scale level**: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.**Ratio scale level**: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated

## 1.4 Experimental Design and Ethics

**Explanatory variable**: The variable whose effect you want to study; the independent variable.**Response variable**: the variable that you suspect is affected by the other variable; the dependent variable.**Experimental Unit**: a single object or individual to be measured**Placebo**: A treatment that cannot influence the response variable**Double-blinded experiment**: one in which both the subjects and the researchers involved with the subjects are blinded.**Nonsampling Error**: an issue that affects the reliability of sampling data other than natural variation**Institutional Review Board**: a committee tasked with oversight of research programs that involve human subjects