Chapter 1: Sampling and Data
Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.
Data: collections of observations.
Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).
Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.
Probability: the chance of an event occurring.
Population: the complete collection of all individuals to be studied.
Sample: a subcollection of members selected from a population.
Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
Parameter: a numerical measurement describing some characteristic of a population.
Statistic: a numerical measurement describing some characteristic of a sample.
Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.
Variable: a characteristic or measurement that can be determined for each member of a population.
Mean: or “average.”
Proportion: part out of the whole/total.
Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.
Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.
Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.
Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.
Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.
Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.
Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).
Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.
Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.
Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.
Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.
Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.
Bias: if the results of the sample are not representative of the population.
Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another
Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.
Response bias: when answers on a survey do not reflect the true feelings of the respondent.
Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.
Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.
Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form
Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.
Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population
Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.
Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.
Undue influence: collecting data or asking questions in a way that influences the response.
Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.
Confounding: When the effects of multiple factors on a response cannot be separated.
Frequency: The number of times a value of the data occurs
Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.
Cumulative relative frequency: The accumulation of the previous relative frequencies
Nominal scale level: data that cannot be ordered nor can it be used in calculations
Ordinal scale level: data that can be ordered; the differences cannot be measured
Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.
Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated
Explanatory variable: The variable whose effect you want to study; the independent variable.
Response variable: the variable that you suspect is affected by the other variable; the dependent variable.
Experimental Unit: a single object or individual to be measured
Placebo: A treatment that cannot influence the response variable
Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.
Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation
Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects
Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.
Data: collections of observations.
Descriptive Statistics: organizing and summarizing data; by graphing and by numerical values (such as an average).
Inferential Statistics: uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result.
Probability: the chance of an event occurring.
Population: the complete collection of all individuals to be studied.
Sample: a subcollection of members selected from a population.
Sampling: selecting a portion (or subset) of the larger population and studying that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
Parameter: a numerical measurement describing some characteristic of a population.
Statistic: a numerical measurement describing some characteristic of a sample.
Representative Sample - the idea that the sample must contain the characteristics of the population. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter.
Variable: a characteristic or measurement that can be determined for each member of a population.
Mean: or “average.”
Proportion: part out of the whole/total.
Quantitative (or numerical) data: data that consists of numbers representing counts or measurements.
Qualitative (or Categorical) data: data that consists of names or labels that are not numbers representing counts or measurements.
Discrete data: quantitative data which results when the number of possible values is either a finite number or a countable number.
Continuous data: quantitative data which results when there are infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps.
Pie Chart: categories of data are represented by wedges in a circle and are proportional in size to the percentage of individuals in each category.
Bar Graph: the length of the bar for each category is proportional to the number or percent of individuals in each category.
Pareto chart: consists of bars that are sorted into order by category size (largest to smallest).
Simple random sample: A sample of n subjects selected in such a way that every possible sample of the same size n has the same chance of being chosen.
Systematic sample: A sample in which the researcher selects some starting point and then selects every kth element in the population.
Stratified sample: A sample in which the researcher subdivides the population into at least two different subgroups (or strata), and then draws a sample from each subgroup.
Cluster sample: A sample in which the researcher first divides the population into sections (or clusters), and then randomly selects all members from some of those clusters.
Convenience sample: A sample in which the researcher simply uses results that are very easy to get. This is not a valid sampling method and will likely result in biased data.
Bias: if the results of the sample are not representative of the population.
Sampling bias: the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another
Nonresponse bias: when individuals selected to be in the sample who do not respond to a survey have different opinions from those who do.
Response bias: when answers on a survey do not reflect the true feelings of the respondent.
Interview error: a trained interviewer is essential to obtain accurate information. They will have the skill necessary to elicit responses and make the interviewee feel comfortable.
Misrepresented Answers: some survey questions result in responses that misrepresent facts or are flat-out lies.
Loaded Questions: The wording and presentation of questions play a large role in the type of response given to the question. The way a question is worded can lead to response bias, so they must always be asked in a balanced form
Ordering of Questions/Words: Questions can be unintentionally loaded by the order of items being considered. Many surveys rearrange the order of the questions within a questionnaire so that responses are not affected by prior questions.
Data-entry error: not technically a result of response bias, data-entry errors will lead to results not representative of the population
Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased.
Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions.
Undue influence: collecting data or asking questions in a way that influences the response.
Non-response or refusal of participation: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.
Confounding: When the effects of multiple factors on a response cannot be separated.
Frequency: The number of times a value of the data occurs
Relative frequency: The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.
Cumulative relative frequency: The accumulation of the previous relative frequencies
Nominal scale level: data that cannot be ordered nor can it be used in calculations
Ordinal scale level: data that can be ordered; the differences cannot be measured
Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.
Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated
Explanatory variable: The variable whose effect you want to study; the independent variable.
Response variable: the variable that you suspect is affected by the other variable; the dependent variable.
Experimental Unit: a single object or individual to be measured
Placebo: A treatment that cannot influence the response variable
Double-blinded experiment: one in which both the subjects and the researchers involved with the subjects are blinded.
Nonsampling Error: an issue that affects the reliability of sampling data other than natural variation
Institutional Review Board: a committee tasked with oversight of research programs that involve human subjects