Chapter 1: Sampling and Data

1.1 Definitions of Statistics, Probability, and Key Terms

  • Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

  • Data: collections of observations.

  • Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).

  • Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.

  • Probability: the chance of an event occurring.

  • Population: the complete collection of all individuals to be studied.

  • Sample: a subcollection of members selected from a population.

  • Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.

  • Parameter: a numerical measurement describing some characteristic of a population.

  • Statistic: a numerical measurement describing some characteristic of a sample.

  • Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.

  • Variable: a characteristic or measurement that can be determined for each member of a population.

  • Mean: or “average.”

  • Proportion: part out of the whole/total.

1.2 Data, Sampling, and Variation in Data and Sampling

  • Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.

  • Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.

  • Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.

  • Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.

  • Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.

  • Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.

  • Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).

Sampling Methods

  • Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.

  • Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.

  • Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.

  • Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.

  • Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.

  • Bias: if the results of the sample are not representative of the population.

Sources of Bias in Sampling

  • Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another

  • Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.

  • Response bias: when answers on a survey do not reflect the true feelings of the respondent.

    • Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.

    • Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.

    • Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form

    • Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.

    • Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population

Common Problems

  • Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.

  • Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.

  • Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.

  • Undue influence: collecting data or asking questions in a way that influences the response.

  • Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.

  • Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.

  • Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.

  • Confounding: When the effects of multiple factors on a response cannot be separated.

1.3 Frequency, Frequency Tables, and Levels of Measurement

  • Frequency: The number of times a value of the data occurs

  • Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.

  • Cumulative relative frequency: The accumulation of the previous relative frequencies

Levels of Measurement

  • Nominal scale level: data that cannot be ordered nor can it be used in calculations

  • Ordinal scale level: data that can be ordered; the differences cannot be measured

  • Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.

  • Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated

1.4 Experimental Design and Ethics

  • Explanatory variable: The variable whose effect you want to study; the independent variable.

  • Response variable: the variable that you suspect is affected by the other variable; the dependent variable.

  • Experimental Unit: a single object or individual to be measured

  • Placebo: A treatment that cannot influence the response variable

  • Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.

  • Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation

  • Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects

robot