population
the entire group of individuals that is the target of our interest; generally too big to actually measure or observe
sample
subgroup of the population which we can examine or observe, measure and collect data from
individual
single entity that is being observed
variable
characteristic measured on each individual
quantitative variable
variable whose possible values are meaningful numbers
categorical variable
variable whose possible responses are non-quantitative categories (words/labels/attributes)
measurement
value of a variable for an individual
data
measurements for a set of individuals (Goal of Statistics: convert this to useful information)
data set
data identified with contextual information (who was observed, what was measured, why is study done) often given in a table
EDA (exploratory data analysis) goals
organize and summarize data
discover features, patterns and striking deviations
interpret patterns in context
include visual displays and numerical values
single variable pattern
distribution of a variable: summary of data one variable at a time (all the possible values and how often they occur)
process of statistical problem solving
Collect data
Summarize data
Interpret data
parameter
numerical fact about the variable in the population
statistic
numerical fact about the variable in the sample
convenience sampling
select individuals in the easiest possible way
volunteer response sampling
individuals select themselves
quota sampling
force the sample to meet specified quotas
simple random sample (SRS)
every possible set of a specified size has an equal chance of being selected
cluster sampling
a random sample of clusters is taken and all individuals in selected clusters are included in sample
stratified random sample
select a random sample (SRS) from each stratum and combine these SRSs together
multi-stage sample
take a sample at each hierarchical level of the population
treatment
the condition applied to a subject in an experiment (one of the subcategories/values of the explanatory variable)
lurking variables
variables that affect both the explanatory and response variables but are not measured or included as a planned factor in the study
control
an effort to reduce the effects of lurking variables
confounding
situation in which effects of lurking variables cannot be distinguished from effects of factors
historical comparison experiments
study involving only one treatment, where treated subjects are compared to untreated subjects from some external source
unreplicated experiments
assigns one subject only to each treatment
confounded experiments
treatment groups are handled differently in some way OTHER than the treatment
undercoverage
some individuals have no possibility of being selected
non-response
some selected individuals choose not to be in the sample because they refuse to provide information or cannot be contacted
misleading response
people lie or give inaccurate answers (often about sensitive issues)
interviewer effect
person asking questions influences responses (for in-person/phone surveys)
question order effect
the order that questions are asked promotes certain responses
question wording
the way a question asked leads, misleads or confuses
open questions
allow for almost unlimited possible responses (short answer), less restrictive but more difficult to analyze
closed questions
limit response options (multiple choice), easier to analyze but may be biased by the options provided. should include "other/unsure" option
observational studies
individuals are not assigned to treatments, are self selected, cannot conclude causation
experiment
study where individuals are assigned to treatments, causation okay if valid
subject
individual to which treatment is applied
response variable
characteristic measure on each subject; outcome of interest
explanatory variable
characteristic/measurement that is use to predict or explain changes in the response variable; variable we think could help us know about the response (measured earlier or more easily); independent variable
factor
planned explanatory variable
comparison
two or more groups; controls lurking variables by including comparison treatments
randomization
randomly assign subjects to groups; neutralizes effects of lurking variables by assigning subjects to treatments using a random device
replication
two or more subjects in each group; assign more that one subject to each treatment to detect important effects
double blinding
neither subjects nor the researchers in direct contact with the subjects know which treatment is received
placebo effect
favorable response of a human subject to a placebo because of trust in the medical provider or belief that the treatment will work
diagnostic bias
diagnosis of subjects is biased by preconceived notions about the effectiveness of the treatment (person administering treatments expects certain responses)
lack of realism
realism is compromised by the conditions of the study
hawthorne effect
people in experiment behave differently than they would normal behave, not like real life
non-compliance
subjects fail to submit to the assigned treatment or refuse to follow the protocol of the experiment
principles of data ethics
• safety and well-being of the subjects must be protected • all individuals must give their informed consent before data are collected • individual data must be kept confidential
randomized controlled experiment
randomly assign subjects to treatments, grouped by treatment
randomized block design
randomly assign to treatments within blocks, grouped by treatment or by block
benefits of randomized block design (RBD)
removes confounding of lurking variables
reduces chance variation by removing variation associated with the blocking variable
yields more precise estimates of chance variation
matched pairs
two treatments; matched individuals or two measurements per subject
three principles of experiments
randomly assign two treatments to two individuals or randomize the order of treatment application to each individual
replication = number of pairs
compare the two treatments
analysis of distribution of quantitative data
always plot data first
look for an overall pattern and for striking deviations
look at shape, center, spread of distribution
add numerical summaries to supplement graph
if pattern is regular, use mathematical model to describe data
symmetric and bell shaped distribution examples
blood pressure, IQ, biological factors
symmetric and bell shaped distribution
mean, median, and mode are the same
right skewed distribution
concentration of data on left, tail extends to the right; mean > median
right skewed distribution examples
salary, home price, children, economic variables
left skewed distribution
concentration of data on right and the tail on the left; median > mean
left skewed distribution examples
test scores, olympic high jump
bimodal distribution
a distribution with two modes
bimodal distribution examples
speed limits, restaurant patrons
flat or uniform distribution
relatively equal across graph
flat or uniform distribution examples
rolling a die, day of the month born
center
typical, middle value; half of data to each side
spread
consistency/inconsistency of data; look for maximum and minimum
outliers
values that are far outside most of data
is data point miscoded?
unusual conditions?
should data point be excluded?
mode
most frequently occurring score, corresponds to a peak
median
the middle score in a distribution; half the scores are above it and half are below it
mean
center of gravity; the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores
mean vs median
construct graph to evaluate skewness and outliers
use median if distribution is markedly skewed or outliers are present
use mean if distribution is roughly symmetric
range
maximum - minimum
interquartile range (IQR)
the difference between the first and third quartiles
standard deviation
average distance of values from the mean
first quartile (Q1)
a number for which 25% of the data is less than that number; same as the median of the data which are less than the overall median
second quartile (Q2)
median
third quartile (Q3)
a number for which 75% of the data is less than that number; same as the median of the part of the data which is greater than the median
5 number summary vs 2 number summary
use 5 number for skewed, and 2 number for symmetric
5 number summary
minimum, Q1, median, Q3, maximum
random phenomenon
individual outcome unpredictable, but outcomes from large number of repetitions follow regular pattern
sample space
the set of all possible outcomes
event
a collection of possible outcomes
probability of an outcome
The proportion of times that an outcome occurs in many, many repetitions of the random phenomenon
probability rules
0<P(A)<1
summation of all probabilities is 1
if two events cannot occur simultaneously, the probability of one or the other equals the sum of separate probabilities
probability of event not occurring equals one minus the probability of event occurring
theoretical probability
number of favorable outcomes divided by total number of possible outcomes
empirical probability
number of outcomes divided by total of repetitions
law of large numbers
As the number of repetitions of a probability experiment increases, the proportion with which a certain outcome is observed gets closer to the theoretical probability of the outcome
probability
the long-run relative frequency with which an event will occur
probability distribution
all possible events and their associated probabilities
random variable
a variable whose value is a numerical outcome of a random phenomenon
continuous random variable
a variable that can take on any possible value, all values cannot be listed
discrete random variable
variable whose possible values are a list of distinct values
𝜇
mean of a population
x-bar
mean of a sample
s
standard deviation of a sample
𝜎
standard deviation of a population