Chapter 1 - Introduction to Statistics

1-1 Statistical and Critical Thinking

The process of conducting a statistical study consists of "prepare, analyze, and conclude."
Preparation involves consideration of the context, the source of data, and sampling method.
Data: collections of observations (such as measurements, genders, or survey responses)
Statistics: the science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and drawing conclusions based on them
Population: complete collection of all measurements/data that are being considered
Census: collection of data from every member of the population
Sample: a subcollection of members selected from a population
Prepare:
- Context: what is the data talking about or referring to
- Source of the data: where the data is from
- Sampling method: how the data is collected
- Voluntary response sample (or self-selected sample): one in which the respondents themselves decide whether to be included
- examples include internet polls, mail-in polls, telephone call-in polls
- Voluntary response samples are typically seriously flawed due to the strong possibility of bias
Analyze:
- Graph and explore: plot appropriate graphs and explore the data
- Apply statistical methods: application of these methods to the data, typically using technology such as calculators or software
Conclude:
- Statistical significance: this is achieved in a study when we get a result that is very unlikely to occur by chance (typically <5%)
- Practical significance: it is possible that some treatment or finding is effective, but common sense may suggest that the treatment or finding does not make enough of a difference to justify its use
Analyzing Data: Potential Pitfalls
- Misleading conclusions: we should avoid making statements not justified by the statistical analysis. CORRELATION DOES NOT IMPLY CAUSATION.
- Sample data reported instead of measured: it is better to take measurements yourself instead of asking subjects to report results. For example, when asking people how much they weigh, people may have a tendency to put a number that is closer to their desired weight than their actual weight.
- Loaded questions: survey questions may be intentionally worded to elicit a desired response
- Order of questions: survey questions can be unintentionally loaded depending on the order of the items being considered
- Nonresponse: when someone either refuses to respond to a survey question or is unavailable
- Percentages: some studies cite misleading or unclear percentages

1-2 Types of Data

Parameter: a numerical measurement describing some characteristic of a POPULATION
Statistic: a numerical measurement describing some characteristic of a SAMPLE
Quantitative (or numerical) data consists of numbers representing counts or measurements
Categorical (or qualitative / attribute) data consists of names or labels
It is important to include appropriate units of measurements (such as $, ft, m).
Discrete data result when the data values are quantitative and the number of values is finite, or countable (for example, number of tosses of a coin before getting tails).
Continuous (numerical) data result from infinitely many possible quantitative values where the collection of values is not countable (for example, the lengths of distances from 0 to 12 cm).
Levels of measurement are important because they tell us which computations and statistical methods are appropriate for that type of data.
- Nominal: characterized by data that consists of names, labels, or categories only. This data cannot be arranged in some order (such as low to high). Examples include a survey with only the responses yes, no, and undecided.
- Ordinal: when the data can be arranged in some order, but differences (obtained through subtraction) between data values either cannot be determined or are meaningless. Examples include course grades (A, B, C, D, F).
- Interval: when the data can be arranged in order, and differences between data values ARE meaningful. Data at this level do not have a natural 0 starting point at which none of the quantity is present. Examples include temperatures and years.
- Ratio: when data can be arranged in order, differences can be found and are meaningful, and there is a natural 0 starting point. Both differences and ratios are meaningful. Examples include heights of students or class times.
- To distinguish between interval and ratio levels of measurement, you can consider whether there is a "true zero" value and whether the term "twice" accurately describes the ratio of one value to be double the other value.
Big data: data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools.
Data science: an area of study that involves applications of statistics, computer science, software engineering, and some other relevant fields.
A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set.
A data value is missing not at random if the missing value is related to the reason that it is missing.
To correct for missing data:
- \
1. Delete cases (delete all subjects having any missing values)
- \
1. Impute missing values (substitute values for those missing values)

1-3 Collecting Sample Data

Randomization with placebo and treatment groups is sometimes called "the gold standard" because it is so effective.
Experiment: we apply some treatment and then proceed to observe its effects on the individuals (the individuals are called experimental units, and they are often called subjects when they are people).
Observational study: we observe and measure specific characteristics, but we don't attempt to modify the individuals being studied.
Lurking variable: a variable that affects the variables included in the study, but is not included in the study.
Good design of experiments include replication, blinding, and randomization.
- Replication: the repetition of an experiment on more than 1 individual.
- Blinding: when the subject doesn't know whether he/she is receiving a treatment or a placebo. This is a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms). A double-blind experiment is where the subject nor the doctor knows whether it is the treatment or a placebo.
- Randomization: when individuals are assigned to different groups through a process of random selection.
Simple random sample: when n subjects are selected in such a way that every possible sample of the same size n has the same chance of being chosen.
Systematic sampling: we select some starting point and then select every kth element in the population.
Convenience sampling: we simply use data that are very easy to get.
Stratified sampling: we subdivide the population into at least 2 different subgroups (or strata) so that subjects within the same subgroup share the same characteristics (such as gender). Then we draw a sample from each subgroup (or stratum).
Cross-sectional study: data are observed, measured, and collected at 1 point in time, not over a period of time.
Retrospective (or case-control) study: data are collected from a past time period by going back in time (by examining records, interviews, etc.)
Prospective (or longitudinal/cohort) study: data are collected in the future from groups that share common factors (known as cohorts).
Confounding occurs when we see some effect, but we can't identify the specific factor that caused it. For example, if a treatment group consists of all women and a placebo group consists of all men, confounding occurs because we cannot tell if the changes are due to gender or the treatment.
Completely Randomized Experimental Design: Assign subjects to different treatment groups through a process of random selection.
Randomized Block Design: A block is a group of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment.
A randomized block design uses the same basic idea as stratified sampling, but randomized block designs are used for EXPERIMENTS whereas stratified sampling is used for SURVEYS.
Matched Pairs Design: Compare 2 treatment groups by using subjects that are matched in pairs that are somehow related or have similar characteristics.
Rigorously Controlled Design: Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment.
It is possible to use a good sampling method and do everything correctly, and still get wrong results.
Sampling error (or random sampling error): this occurs when the sample has been selected with a random method, but there is discrepancy between the sample result and the true population result, such as error results from chance sample fluctuations.
Nonsampling error: the result of human error, such as wrong data entries, computing errors, false data provided by respondents, etc.
Nonrandom sampling error: the result of using a sampling method that is not random (such as a convenience or voluntary response sample).