Chapter 1: Sampling and Data

1.1 Definitions of Statistics, Probability, and Key Terms

  • Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.
  • Data: collections of observations.
  • Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).
  • Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.
  • Probability: the chance of an event occurring.
  • Population: the complete collection of all individuals to be studied.
  • Sample: a subcollection of members selected from a population.
  • Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
  • Parameter: a numerical measurement describing some characteristic of a population.
  • Statistic: a numerical measurement describing some characteristic of a sample.
  • Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.
  • Variable: a characteristic or measurement that can be determined for each member of a population.
  • Mean: or “average.”
  • Proportion: part out of the whole/total.

1.2 Data, Sampling, and Variation in Data and Sampling

  • Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.
  • Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.
  • Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.
  • Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.
  • Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.
  • Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.
  • Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).

Sampling Methods

  • Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.
  • Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.
  • Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.
  • Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.
  • Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.
  • Bias: if the results of the sample are not representative of the population.

Sources of Bias in Sampling

  • Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another
  • Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.
  • Response bias: when answers on a survey do not reflect the true feelings of the respondent.
    • Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.
    • Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.
    • Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form
    • Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.
    • Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population

Common Problems

  • Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.
  • Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
  • Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.
  • Undue influence: collecting data or asking questions in a way that influences the response.
  • Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
  • Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
  • Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.
  • Confounding: When the effects of multiple factors on a response cannot be separated.

1.3 Frequency, Frequency Tables, and Levels of Measurement

  • Frequency: The number of times a value of the data occurs
  • Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.
  • Cumulative relative frequency: The accumulation of the previous relative frequencies

Levels of Measurement

  • Nominal scale level: data that cannot be ordered nor can it be used in calculations
  • Ordinal scale level: data that can be ordered; the differences cannot be measured
  • Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.
  • Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated

1.4 Experimental Design and Ethics

  • Explanatory variable: The variable whose effect you want to study; the independent variable.
  • Response variable: the variable that you suspect is affected by the other variable; the dependent variable.
  • Experimental Unit: a single object or individual to be measured
  • Placebo: A treatment that cannot influence the response variable
  • Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.
  • Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation
  • Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects

\