1/90
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
statistics
way to make sense of data
science that deals with collection, classification, analysis, and interpretation of numerical data
science of making decisions regarding the characteristics of observations based on information obtained from a randomly selected sample of a group
affect personal decision-making
probability
likelihood something will happen; outcome is uncertain
research involves
formulating a question of interest
designing a study
collecting data (statistics)
analyzing data (statistics)
interpreting results (statistics)
drawing conclusions (statistics)
anecdotal evidence
informal observations
ex. older men seem to gamble more than younger men
may be true but small samples that don’t represent entire population of interest
need formal study with statistical evidence about age and risk-taking behavior
good research question should state
groups of interest
response of interest
broad vs. focused question
population
set of all subjects of interest
sample
subset of population of interest on which you collect data
census
collecting data on everyone
difficultly with census
expensive
more complex than taking a sample
difficult to complete (some individuals are hard to locate)
populations are dynamic and constantly changing
descriptive statistics
used to summarize collected data
inferential statistics
used to draw conclusions about a population, based on data obtained from a sample of population
parameter
numerical summary of the population
we want to make inferences on parameters
true value of parameter is unknown
denoted with Greek letters
statistic
numerical summary of the sample
calculate from sample data
denoted with lowercase letters, bars, and hats
sample data and statistical inference
sample data are an approximate (imperfect) reflection of population data
sample may not match population
statistical inference describes what is likely happening in the population based on observed sample data
must understand variation in data
variable
any characteristic observed in a study
categorical
each observation belongs to one of a set of categories
contain descriptive words/phrases
quantitative
observations take on numeric values
discrete
finite number of possible values
0, 1, 2, …
continuous
continuum of infinitely many possible values
ex. 1:54.2, 1:54.90
dichotomous
only 2 categories
dead or alive
nominal
two or more categories, but no intrinsic/natural ordering
ex. blood type
ordinal
categories have a natural ordering
ex. years in college
distribution
possible values a variable can take on and the occurrence of those values
frequency
number of observations in each category
proportion
number of observations in each category divided by the total number of observations (relative frequency)
percentage
proportion multiplied by 100
modal cateogory
category with highest frequency
sample data are an approximate (imperfect) reflection of population data
sometimes what we see in a sample is not exactly how things are in the population
statistical inference
describing what you think is likely to be happening in the population based on your observed sample data
to do this, we need to understand variation in our data
data visualization: categorical variables
all data once must be sorted, cleaned, and arranged to see what is going on before we perform statistical analyses on it
graphs for categorical variables
pie chart
bar plot
dot plot
steam and leaf plot
histogram (summarizes quantitative data, not display exact values)
time series plot
scatterplot
the graph you choose depends on
type of data you have
features of the data you want to highlight
informal observations constitute anecdotal evidence
ex. older men seem to gamble more than younger men
anecdotal evidence may be true, but is often based on small samples that are not representative of an entire population of interest
we are unable to collect data from all men, so we need a strategy for determining who and how many men to collect data from to make conclusions about age and risk-taking behavior
sampling frame
list of subjects in the population from which the sample is taken
method
used to collect data is sampling design
non-random sampling methods are likely to suffer from
bias
sampling
process of selecting units (cases; persons, objects, events)
probability sampling
units selected randomly; units have a known probability of being selected
simple random sampling
stratified sampling
cluster sampling
non-probability sampling
units selected non-randomly
volunteer sample; convenience sample
simple random sampling
every member has equal probability of being selected
ex. random number generator
issue: underrepresentation of a certain group in your population
solution: draw a larger sample to ensure representation
stratified
divide sampling frame into strata (subpopulations) then randomly select from within each strata
cluster
sampling in stages
useful if target population is very large + you don’t really have a sampling frame
break group into clusters/natural groups —> randomly select group of clusters —> obtain sampling frame from each cluster —> draw random sample from each cluster
how do select individuals to participate in your study?
ideally, you want participants to be representative sample from your population so that your statistical inference can be generalizable to the population
bias
present when the results of the sample are not representative of the population
sampling bias (coverage bias)
result from sampling method
sample not actually random
sampling frame does not represent entire population
nonresponse bias
occurs when people do not participate
participants may have different characteristics than non-participants
participants may only respond to some questions, generating missing data
response bias
occurs when participants give inaccurate answers
participants lie or misremember
questions can be confusing/misleading
nonprobability sampling
sometimes probability sample is difficult, not possible, or inappropriate for public health issues
ex. homeless populations are both hard to identify and not easily accessible
issue is to enhance insight or understanding of a small or specific social unit/group
if random sampling doesn’t make sense…use nonprobability sampling methods
bar plot
can use either the frequency or percent on y-axis
dot plot
horizontal line shows range of values for the variable of interest
each dot represents an observation
dot plots show exact data
stem and lead plot
vertical line separates the stem from the leaf
the stem (left) shows all digits except the last one
the leaf (right) shows the last digit
steam and leaf plots show exact data values
histogram
useful to get a sense of the shape of the distribution of data
range of values for the variable of interest on x-axis
values are grouped into equal width intervals
frequency or relative frequency of occurrence for groups of values on y-axis
summarizes quantitative data it does not display exact values
too few intervals may not be informative or useful
too many intervals may make it too difficult to see trends
6-10 intervals is usually appropriate
time series plot
display data that are collected overt time
x-axis as time
y-axis as variable of interest
trends are more easily identified when we connect the points with lines
scatterplots
useful to explore the relationship between two continuous variables
experimental
participants assigned to experimental conditions; response variable/outcome of interest is then observed
experimental conditions: treatments
establish cause and effect; reduces potential for confounding variables to affect results through random assignment; typically has control and treatment group; random assignment
observational
researchers observe both the response and explanatory variables without assigning a “treatment”
non-experimental
cannot establish cause and effect; confounding variables can influence results; comparison group; random sampling
why randomly assign individuals to treatment and control groups?
allow us to make sure groups are balanced with respect to other characteristics
comparing results between groups allows us to determine if intervention was effective
these allow us to attribute any observed effects as the result of experimental assignment (rather than confounding variables); can conclude a causal effect
control
compare treatment of interest to control group
randomize
randomly assign subjects to treatment and control groups
replicate
collect a sufficiently large sample size or replicate the entire study
block
account for variables known for suspected to affect the response of interest
placebo
“fake” treatment, often used as control group
placebo effect
showing change despite being on the placebo
blinding
experimental units don’t know which group they are in
double-blind
both experimental units and researchers don’t know the group assignment
multifactoral experimental studies
categorical explanatory variables in experiments may be referred to as factors
sometimes it may be of interest to evaluate the effect of multiple factors
blocking in experimental studies
blocking creates groups (blocks) that are similar with respect to the blocking variable; then treatment is assigned
separate participants into groups by whether they already use
randomly assign thing or placebo within each block
factors vs. blocking
factors are conditions we can impose on the experimental units (explanatory variables; ex treatment vs. control)
blocking variables are characteristics that the experimental units come with, which we would like to control for
blocking in experimental studies is just like stratifying
designs of observational studies
cross-sectional
longitudinal
association does not imply causation - only examines associations
cross-sectional
one time point; a “snapshot”
longitudinal
same participants studied multiple times over time
confounding
occurs when a third variable is associated with both the explanatory and response variable
sometimes study results cannot be replicated because of
different sampling of techniques (different kinds of people are enrolled in the study)
different explanatory variables are examined
poor data management/analysis
the finding was spurious to begin with
mean
sum of observations divided by the number of observations
highly influenced by outlier
median
middle value of ordered data
when odd, median is middle value of ordered data
when even, median is the average of the two middle data points
fairly resistant to outliers
range
difference between the largest and smallest observations
range = max - min
severely affected by outliers
standard deviation
represents a type of average distance of an observation from the mean
quantifies variability observed in the data
has same unit of measurement as the original data
s is undefined; when all observations take on the same value, the s is 0
the larger the standard deviation, the greater the variability in the data
not resistant to outliers
variance
square of standard deviation
units of measurement for the variance is in the square units of measurement for the original data
we report the standard deviation more frequently than the variance in summary statistics
histograms can be used to understand the
distribution of data and describe the overall pattern
unimodal
one peak
symmetric
mirror image when folded in half
mean and median are approximately equal
mean is an appropriate measure of central tendency
bell-shaped
follows a bell-shaped curve
bimodal
two peaks
left-skewed
left tail is longer than the right (skew is in the direction of the tai)
mean is less than the median
mean pulled in the direction of the long left tail
in highly skewed distribution, the median is preferred over the mean as a measure of central tendency (it better represents what is typical)
right-skewed
right tail is longer than the left (skew is in the direction of the tail)
mean is greater than the median
mean is pulled in the direction of the long right tail
in highly skewed distributions, the median is preferred over the mean as a measure of central tendency (it better represents what is typical)
uniform distribution
all values seem approximately equally likely
percentiles
value such that p percent of the observations fall below or at that value
median - 50th percentile
first quartile - 25th
third quartile - 75th percentile
finding quartiles
order your data
identify middle of the data
examine the lower half of the data defined by the median. The median of the lower half is the 25th percentile (first quartile)
examine the upper half of the data defined by the median. The median of the upper half is the 75th percentile (third quartile)
interquartile range
data and distance between third and first quartiles
IQR = Q3 -Q1
resistant to outliers
range of middle half of the data
potential outlier
below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR
possible for an observation to fall outside of these bounds and not truly be an outlier
five number summary
minimum value, Q1, median, Q3, maximum value
displayed in boxplot
whiskers extend out to the smallest and largest observations that are not potential outliers (indicated with circles)