Chapter 1: Sampling and Data
1.1 Definitions of Statistics, Probability, and Key Terms
- Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.
- Data: collections of observations.
- Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).
- Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.
- Probability: the chance of an event occurring.
- Population: the complete collection of all individuals to be studied.
- Sample: a subcollection of members selected from a population.
- Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
- Parameter: a numerical measurement describing some characteristic of a population.
- Statistic: a numerical measurement describing some characteristic of a sample.
- Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.
- Variable: a characteristic or measurement that can be determined for each member of a population.
- Mean: or “average.”
- Proportion: part out of the whole/total.
1.2 Data, Sampling, and Variation in Data and Sampling
- Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.
- Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.
- Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.
- Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.
- Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.
- Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.
- Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).
Sampling Methods
- Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.
- Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.
- Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.
- Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.
- Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.
- Bias: if the results of the sample are not representative of the population.
Sources of Bias in Sampling
- Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another
- Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.
- Response bias: when answers on a survey do not reflect the true feelings of the respondent.
- Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.
- Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.
- Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form
- Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.
- Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population
Common Problems
- Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.
- Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
- Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.
- Undue influence: collecting data or asking questions in a way that influences the response.
- Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
- Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
- Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.
- Confounding: When the effects of multiple factors on a response cannot be separated.
1.3 Frequency, Frequency Tables, and Levels of Measurement
- Frequency: The number of times a value of the data occurs
- Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.
- Cumulative relative frequency: The accumulation of the previous relative frequencies
Levels of Measurement
- Nominal scale level: data that cannot be ordered nor can it be used in calculations
- Ordinal scale level: data that can be ordered; the differences cannot be measured
- Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.
- Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated
1.4 Experimental Design and Ethics
- Explanatory variable: The variable whose effect you want to study; the independent variable.
- Response variable: the variable that you suspect is affected by the other variable; the dependent variable.
- Experimental Unit: a single object or individual to be measured
- Placebo: A treatment that cannot influence the response variable
- Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.
- Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation
- Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects