# Chapter 1: Sampling and Data

## 1.1 Definitions of Statistics, Probability, and Key Terms

• Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

• Data: collections of observations.

• Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).

• Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.

• Probability: the chance of an event occurring.

• Population: the complete collection of all individuals to be studied.

• Sample: a subcollection of members selected from a population.

• Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.

• Parameter: a numerical measurement describing some characteristic of a population.

• Statistic: a numerical measurement describing some characteristic of a sample.

• Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.

• Variable: a characteristic or measurement that can be determined for each member of a population.

• Mean: or “average.”

• Proportion: part out of the whole/total.

## 1.2 Data, Sampling, and Variation in Data and Sampling

• Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.

• Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.

• Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.

• Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.

• Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.

• Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.

• Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).

### Sampling Methods

• Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.

• Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.

• Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.

• Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.

• Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.

• Bias: if the results of the sample are not representative of the population.

### Sources of Bias in Sampling

• Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another

• Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.

• Response bias: when answers on a survey do not reflect the true feelings of the respondent.

• Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.

• Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.

• Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form

• Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.

• Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population

### Common Problems

• Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.

• Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.

• Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.

• Undue influence: collecting data or asking questions in a way that influences the response.

• Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.

• Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.

• Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.

• Confounding: When the effects of multiple factors on a response cannot be separated.

## 1.3 Frequency, Frequency Tables, and Levels of Measurement

• Frequency: The number of times a value of the data occurs

• Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.

• Cumulative relative frequency: The accumulation of the previous relative frequencies

### Levels of Measurement

• Nominal scale level: data that cannot be ordered nor can it be used in calculations

• Ordinal scale level: data that can be ordered; the differences cannot be measured

• Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.

• Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated

## 1.4 Experimental Design and Ethics

• Explanatory variable: The variable whose effect you want to study; the independent variable.

• Response variable: the variable that you suspect is affected by the other variable; the dependent variable.

• Experimental Unit: a single object or individual to be measured

• Placebo: A treatment that cannot influence the response variable

• Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.

• Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation

• Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects