Statistics: Informed Decisions Using Data, Seventh Edition - Chapter 1: Data Collection

Introduction to the Practice of Statistics

Statistics is defined as the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. Additionally, statistics provides a measure of confidence in any conclusions drawn.
Data is the information referred to in the definition of statistics. Data are a "fact or proposition used to draw a conclusion or make a decision." Data describe the characteristics of an individual.
A key aspect of data is that they vary. One goal of statistics is to describe and understand sources of variability among individuals and within measurements on a single individual (e.g., variation in height, hair color, or daily caloric intake).
The population of a study refers to the entire group of individuals to be studied.
An individual is a person or object that is a member of the population being studied.
A sample is a subset of the population that is being studied.
A statistic is a numerical summary of a sample.
A parameter is a numerical summary of a population.
Example: Parameter Versus Statistic:
- If $48.2\%$ of all students on a campus own a car, this value is a parameter because it summarizes a population.
- If a sample of $100$ students is obtained and $46\%$ own a car, this value is a statistic because it summarizes a sample.
Descriptive statistics consist of organizing and summarizing data. This is achieved through numerical summaries, tables, and graphs.
Inferential statistics uses methods that take a result from a sample, extend it to the population, and measure the reliability of that result.
The Process of Statistics involves four major steps:
- 1. Identify the research objective: Establish the question(s) to be answered and clearly identify the population to be studied.
- 2. Collect the data needed: Typically involving a sample because studying the entire population is often difficult and expensive. Data must be collected correctly to ensure conclusions are meaningful.
- 3. Describe the data: Use descriptive statistics to obtain an overview and determine the appropriate statistical methods for further analysis.
- 4. Perform inference: Apply techniques to extend sample results to the population and report a level of reliability.
Variables are the characteristics of the individuals within the population. Variables vary among different individuals.
Qualitative (Categorical) variables allow for the classification of individuals based on some attribute or characteristic.
Quantitative variables provide numerical measures of individuals. The values of quantitative variables can be added or subtracted to provide meaningful results.
Discrete variables are quantitative variables that have either a finite number of possible values or a countable number of possible values (e.g., $0, 1, 2, 3$ ). They cannot take on every possible value between any two values.
Continuous variables are quantitative variables that have an infinite number of possible values that are not countable. They may take on every possible value between any two values.
Types of Data:
- Qualitative data are observations corresponding to a qualitative variable.
- Quantitative data are observations corresponding to a quantitative variable.
- Discrete data are observations corresponding to a discrete variable.
- Continuous data are observations corresponding to a continuous variable.
Levels of Measurement of a Variable:
- Nominal level: The values of the variable name, label, or categorize. The naming scheme does not allow for the values to be arranged in a ranked or specific order (e.g., Race).
- Ordinal level: Similar to nominal, but the naming scheme allows for values to be arranged in a ranked or specific order (e.g., Letter grades).
- Interval level: Similar to ordinal, but differences in the values have meaning. A value of zero does not mean the absence of the quantity. Addition and subtraction can be performed (e.g., Temperature).
- Ratio level: Similar to interval, but the ratios of the values have meaning. A value of zero means the absence of the quantity. Multiplication and division can be performed (e.g., Number of days studied).

Observational Studies Versus Designed Experiments

In research, the response variable is the outcome being measured, while the explanatory variable is the factor being varied to determine its effect on the response.
An observational study measures the value of the response variable without attempting to influence the value of either the response or explanatory variables. The researcher observes behavior without trying to influence the outcome.
A designed experiment occurs if a researcher assigns individuals in a study to groups, intentionally manipulates the value of an explanatory variable, and records the value of the response variable for each individual.
Confounding in a study occurs when the effects of two or more explanatory variables are not separated. Relations between an explanatory variable and a response may actually be due to other variables not accounted for.
A lurking variable is an explanatory variable that was not considered in a study but affects the value of the response variable. They are typically related to the explanatory variables that were considered.
A confounding variable is an explanatory variable that was considered in a study whose effect cannot be distinguished from a second explanatory variable in the study.
Observational studies do not allow a researcher to claim causation, only association.
Types of Observational Studies:
- Cross-sectional Studies: Collect information about individuals at a specific point in time or over a very short period.
- Case-control Studies: Retrospective studies requiring individuals to look back in time or researchers to look at existing records. Individuals with certain characteristics are matched with those that do not have them.
- Cohort Studies: Prospective studies where a group (the cohort) is identified and then observed over a long period. Characteristics are recorded over time.
A census is a list of all individuals in a population along with certain characteristics of each individual.
Web scraping (data mining) is the process of extracting data from the Internet, which can involve unstructured data transformed through parsing and reformatting. This practice raises ethical issues regarding permission.

Simple Random Sampling

Random sampling is the process of using chance to select individuals from a population for a sample. This ensures the sample is representative. Convenience sampling leads to meaningless results.
Simple Random Sampling: A sample of size $n$ from a population of size $N$ is obtained if every possible sample of size $n$ has an equally likely chance of occurring.
A frame is a list of all the individuals within the population, required to assign unique numbers for sampling.
Sampling with replacement: A selected individual is placed back into the population and can be chosen again.
Sampling without replacement: A selected individual is removed from the population and cannot be chosen again.
Methods for obtaining a simple random sample include using a table of random numbers or using technology (graphing calculators or software).
For technology, a seed is an initial point for the generator to start creating random numbers.

Other Effective Sampling Methods

Stratified Sample: Obtained by separating the population into nonoverlapping groups called strata and then obtaining a simple random sample from each stratum. Individuals within each stratum should be homogeneous (similar) in some way.
Systematic Sample: Obtained by selecting every $k^{th}$ individual from the population. The first individual selected is a random number $p$ between $1$ and $k$ .
Steps in Systematic Sampling:
- 1. Approximate population size $N$
- 2. Determine desired sample size $n$
- 3. Compute $k = \frac{N}{n}$ and round down to the nearest integer.
- 4. Randomly select a number $p$ between $1$ and $k$ .
- 5. The sample consists of individuals: $p, p+k, p+2k, \dots, p+(n-1)k$ .
Cluster Sample: Obtained by selecting all individuals within a randomly selected collection or group of individuals.
Stratified vs. Cluster Sampling:
- Stratified: Divide into homogeneous groups, then take a simple random sample from every group.
- Cluster: Divide into groups, then take a simple random sample of the groups themselves and survey every individual within the chosen groups.
Convenience Sample: Individuals are easily obtained and not based on randomness (e.g., self-selected or voluntary response samples). These results are suspect and should be viewed with extreme skepticism.
Multistage Sampling: The use of a combination of techniques. For example, Nielsen Media Research uses 2 stages:
- Stage 1: Stratification by geographic area (city blocks/rural regions) and random selection of about $6000$ strata.
- Stage 2: Simple random sampling of households within the selected strata.

Bias in Sampling

Bias occurs if results of the sample are not representative of the population.
Sampling Bias: The technique used favors one part of the population over another. It includes undercoverage, where one segment of the population is lower in the sample than in the population.
Nonresponse Bias: Exists when individuals selected for the sample who do not respond have different opinions from those who do. Improving nonresponse involves callbacks or incentives.
Response Bias: Exists when answers on a survey do not reflect the true feelings of the respondent. Sources include:
- Interviewer error: Lack of skill in making the interviewee comfortable.
- Misrepresented answers: Respondents lying or misrepresenting facts.
- Wording of questions: Leading questions that are not in balanced form.
- Ordering of questions or words: Prior questions affecting the response of later ones.
- Type of question: Open questions (choice of response) vs. closed questions (predetermined list).
- Data-entry error: Errors occurring during input (e.g., entering $39$ as $93$ ).
Nonsampling errors result from undercoverage, nonresponse bias, response bias, or data-entry error. These can occur even in a census.
Sampling error results from using a sample to estimate population information; it occurs because a sample provides incomplete information about the population.

The Design of Experiments

An experiment is a controlled study conducted to determine the effect of varying one or more explanatory variables (factors) on a response variable.
A treatment is any combination of factor values.
The experimental unit (or subject) is the person, object, or item upon which a treatment is applied.
A control group serves as a baseline for comparison.
A placebo is an innocuous medication (like a sugar pill) used so individuals in the control group behave the same as the treatment group.
Blinding refers to the nondisclosure of the treatment being received.
- Single-blind: The subject does not know which treatment they receive.
- Double-blind: Neither the subject nor the researcher in contact with them knows which treatment is being received.
Steps in Designing an Experiment:
- 1. Identify the Problem/Claim: Clearly state the question, the response variable, and the population.
- 2. Determine Factors: Identify factors affecting the response variable; decide which to fix, manipulate, or leave uncontrolled.
- 3. Determine Number of Units: Choose based on time and money constraints.
- 4. Determine Level of Factors: Use control (setting factors at specific levels) or randomization (assigning units randomly to groups to mute uncontrolled variables).
- 5. Conduct Experiment: Apply replication (assigning treatments to multiple units to ensure results are not unique to one unit). Collect and process data.
- 6. Test the Claim: Use inferential statistics to generalize sample results to the population with a certain level of confidence.
Experimental Designs:
- Completely Randomized Design: Each experimental unit is randomly assigned to a treatment.
- Matched-Pairs Design: Experimental units are paired (e.g., before/after, twins, spouses). There are only two levels of treatment.
- Randomized Block Design: Used when units are divided into homogeneous groups called blocks. Within each block, units are randomly assigned to treatments. Blocking reduces variability due to factors (like variety or age) so comparisons occur within the blocks.

Questions & Discussion

Research Example: Cellular Phones and Brain Tumors
- Objective: Determine association between cell phones and brain tumors.
- Study 1 (Observational): Followed $791,710$ women in the UK for $7$ years. $1261$ incidences found; no significant difference between users and non-users.
- Study 2 (Experimental): Concerned humans cannot be unethically exposed, so researchers used rats. $90$ rats assigned to $3$ groups: Control (no radiation), Group 2 (GSM-modulated RFR), Group 3 (CDMA-modulated RFR). Radiation was $10$ minutes on/ $10$ off for $9$ hours daily over $2$ years. Results were not statistically significant.
Research Example: Flu Shots for Seniors
- Objective: Determine long-term benefit for seniors $65+$ using records of $36,000$ seniors over $10$ years.
- Findings: Seniors getting shots were $27\%$ less likely to be hospitalized and $48\%$ less likely to die from pneumonia or influenza.
- Flaw identified: Confounding/Lurking variables such as health status or mobility may have influenced the results as it was an observational study.
Research Example: Lipitor (Pfizer)
- Objective: Assess effect on cardiovascular disease in $2838$ subjects with type 2 diabetes.
- Design: Placebo-controlled, double-blind study. $1428$ received Lipitor ( $10\,mg$ ) vs. $1410$ received placebo over $4$ years.
- Results: Lipitor group had $83$ major events (vs. $127$ placebo) and $61$ deaths (vs. $82$ placebo).