1/156
Modules 1-4
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
learning to use statistics is NOT about…
calculations
mechanical conclusions
certainty and exactitude
statistics should be concerned with…
creating arguments that convey interesting and credible points based on interpretation of appropriate evidence from empirical measurements or observations
analysis of data requires observing some _____ _______
comparative difference
example of comparative difference
are the scores for a sample of people on some outcome measure different for two groups of people assigned to different conditions of an experiment
inferential statistics provides guidance in doing what
differentiating among different types of explanations (chance vs systematic) for a given comparative difference
what is considered to be a claim
a chosen explanation
what are the competing explanations for CD’s
chance vs systematic explanations
3 explanations for difference
systematic factor accounts for ALL variation
random chance accounts for ALL variation
combination of systematic and chance account for variation
in psychology, which explanation for difference is rarely/never used
the first explanation (systematic factor accounts for ALL variation)
almost never tenable and often not directly tested
what explanations do we test? what happens if that explanation is rejected?
we first test the chance explanation
if that doesnt work and our test rejects it, then we accept the combination explanation
what is NHST meant to assess
if our observed difference is substantially different than what one would expect from what would occur if random chance factors were completely responsible (the null hypothesis)
NHST is the ____ _____ for differentiating between chance and systematic influence explanations
dominant procedure
(NHST) if the data are not dramatically discrepant with what would be expected from chance, what can we conclude?
the random chance explanation is a tenable explanation for our difference
(NHST) if the test indicates the data are highly inconsistent with a chance explanation, what can we conclude?
we discard the purely chance explanation and by process of elimination prefer the explanation of a combined systematic influence and chance factors
why dont we use terms like “accept” and “reject” in formal statistical language
generally considered too strong and definitive to use these terms because NHST are aids to judging between explanations and not absolute declarations of truth or falsity
alternatives to “accept” and “reject”
if we fail to reject - “null hypothesis remains viable” or “we have retained the null hypothesis”
if we reject - we have “discredited” the null hypothesis
limitations of NHST
backhanded way of testing a research hypothesis
concluding that there is some systematic difference is a very limited form of information
what is the logic behind Abelson’s MAGIC criteria
Abelson believes that the goal of statistical analysis should be to make compelling and persuasive claims; the MAGIC criteria helps us to establish if a claim is compelling and persuasive
MAGIC
magnitude
articulation
generality
interestingness
credibility
M (MAGIC)
magnitude; how big is the effect? large effects are more compelling than small ones
A (MAGIC)
articulation; how specific is the claim? precise statements are more compelling than imprecise ones.
G (MAGIC)
generality; how generally does it apply? more general effects are more compelling than less general ones. claims that would interest a more general audience are more compelling
I (MAGIC)
interestingness; interesting effects are those that "have the potential, through empirical analysis, to change what people believe about an important issue" more interesting effects are more compelling than less interesting ones. In addition, more surprising effects are more compelling than ones that merely confirm what is already known
C (MAGIC)
credibility; credible claims are more compelling than incredible ones. the researcher must show that the claims made are credible. results that contradict previously established ones are less credible
chance is the ______ ______ we assume, UNLESS….
baseline explanation; the data require us to adopt a more complex explanation (chance + systematic)
NHST is used to tell us how…
rare sample discrepancy is from the sample and the true pop
sampling error
error in statistical analysis arising from the unrepresenativeness of the sample taken
if NHST indicates the observed difference is common for a sample with no difference….
chance explanation is viable
if NHST indicates the observed difference is very rare for a sample with no difference….
chance explanation is NOT viable
what influences NHST?
sample size
in NHST, differences in larger samples are…
smaller because they tend to more accurately represent the population
unlikely; NHST is going to assess a difference as unlikely even if the difference is only small because sampling error is low in this situation
in NHST, differences in smaller samples are…
larger because they tend to less accurately represent the population
UNlikely; NHST is going to assess a difference as unlikely if the difference is substantial because even small differences often occur with high sampling error
small sample =
bigger sampling error
large sample =
smaller sampling error
what is an independent samples t-test
a statistical test that compares two samples that are independent from one another (they are drawn from separate populations)
generic formula for t-tests
what do larger t values indicate
greater likelihood of discrepancy from the hypothesized (population) value
less likely that the 2 samples were drawn from 2 populations that DO NOT differ in mean scores
more difference between pop.
more weird/unlikely
what is standard error
precision of sample estimates/amount of error in the sample
larger SE (t-test) =
smaller t value
what do smaller t values indicate
less difference between sample and population
less weird and unlikely
standard deviation reflects the…
dispersion of scores around the sample mean
formula for independent samples t-test
where 𝑋1 and 𝑋2 are means from the two samples
where 𝜇1 and 𝜇2 are means from the two populations
characteristics of the t-value distribution
helps us interpret a t value with a given df
centers on a mean of zero with the majority of values falling relatively close to zero and fewer and fewer the more one moves away from zero
when there are few df, the values tend to spread out from zero. They cluster closer to zero as df increases and eventually approximates the normal distribution (above 120 very close)
the error in our sample mean as an estimate of the population mean (standard error) decreases as…
dispersion decreases and sample size increases
what is the level where we decide the chance explanation is no longer tenable? (t-tests)
alpha
conventional standard for alpha (t-tests)
5% (p ≤ .05)
two-tailed tests
test where there is no direction
no special status to direction of effect so we consider the 2.5% most extreme negative values and 2.5% most extreme positive values
type I error
concluding a mean difference exists in the populations (rejecting the null) when there is actually no difference (the null is true)
alpha is the chance of ____ ___ _____ we are willing to accept (t-tests)
type I error
what is a one-tailed test
when a researcher has a strong basis to make a directional prediction regarding the mean difference, differences in the “wrong” direction will be dismissed and treated as similar to null effects
one will only consider extreme t values in one direction (e.g., positive)
rather than splitting 5% into both tails of the distribution, all 5% is are in one tail
one-tailed tests make the test more….
“liberal” in that less extreme values are required for significance
a significant one-tailed t value (p = .05) will correspond with….
a two-tailed t value with a p = .10 (one-tailed p x 2)
criticisms of one-tailed directional tests
how sure do we need to be to use it?
too liberal?
can we/should we ignore differences in the opposite direction?
what is a lopsided tests
not common but used to compromise between one and two-tailed tests when a researcher has a directional prediction
differentially weight the tails of the distribution (i.e., more liberal threshold for the predicted direction and a more conservative threshold for the unexpected direction)
makes it easier to abandon the null if it is in the expected direction, but does allow for abandoning the null for an unexpected finding if it meets a very stringent standard
no widely accepted standard - researcher could specify any differential weighting so long as the researcher could defend the logic of the choice
type II error
concluding there is no mean difference between our populations (accepting or failing to reject the null) when there is actually a difference in means between the populations (the null is false)
conventional level for type II error
.20
what is power
probability that a statistical test will correctly reject a false null hypothesis
relationship between power and type II error (t-tests)
power is inversely related to Type II error
as the statistical power of a test increases, the likelihood of making a Type II error decreases (and vice versa)
determinants of power (t-tests)
alpha level; stricter the alpha, the lower the power (controlled by researcher)
sample size; larger the sample size, the greater the power (controlled by researcher)
magnitude of effect - effect size; larger the effect of the IV, the greater the power (somewhat under control of researcher)
how can we plan sample size based on power (t-tests)
can use power as a basis for determining an appropriate sample size
we specify our alpha (e.g., .05), specify our desired power (e.g., .80), and then make an assumption about the magnitude of the effect we expect, we can calculate the sample size required to achieve that power
traditional view of problems with low power
difficult to interpret null findings
wasteful to conduct research with low power if you get a null
perhaps not a big deal if we get a significant effect?
contemporary view of problems with low power
problem is more complex: low power is not only an issue with false negatives. also can have false positive
power is low because we typically have small effect and/or small sample size
studies of this sort will tend to have lots of error in estimating populations
not so misleading if we run lots of studies and report them all
is a problem IF…
we do a single study and then only report it if its significant
do lots of studies and only report the significant ones
power is a major concern in the replication crisis
assumptions of independent samples t-tests
independence of observations
the distribution of the outcome variable should be normally distributed in each group
homogeneity (equality) of variance in the outcome
variable across the groups
what is a repeated measures t-test
testing a difference between two means for the same sample of people
contexts for using repeated measures t-tests
testing Time 1 and Time 2 differences on the same outcome measure
testing differences on the same outcome measure under different conditions (e.g., within-subjects experiment)
testing differences in means for two different outcome measures (requires equivalence of scaling)
formula for repeated measures t-test
𝜇𝐷 is the mean of the difference scores in the population
𝑆𝐷 is the standard error for the sample mean of difference scores
for repeated measures t-tests, factors affecting the size of the t value are….
similar: magnitude of difference, standard deviation of difference scores, and sample size
concepts such as alpha (α),one-tailed vs. two-tailed tests, beta (β), and power all remain the same
factors affecting the size of the t value (t-tests)
magnitude of difference
t-value tends to increase as the magnitude of the difference between the groups or conditions increases.
if the difference between groups is very small, the t-value is likely to be small
standard deviations of difference scores
when the SD is small (less variability), it leads to a larger t-value because the difference between groups is relatively more pronounced compared to the variation within each group.
a larger SD (greater variability) results in a smaller t-value because the difference between groups is less clear when compared to the inherent variability within each group.
sample size
larger sample size provides more data, which can reduce the impact of random variation and make it easier to detect significant differences
smaller sample sizes can lead to larger variability in the t-value, making it harder to detect significant differences unless the effect is very large
assumptions of repeated measures t-tests
independence of observations
difference scores are normally distributed
advantages/disadvantages of independent samples vs repeated measures designs
RM have more power
RM are more economical
IND. have no carry over effects
IND. less vulnerable to demand characteristics
carry over effects
when the effects of one treatment or condition persist and influence the outcomes of subsequent treatments or conditions
demand characteristics
subtle cues or expectations within an experiment that may influence participants' behavior or responses
NHST does not speak to the….
size of the difference; and we need to know the magnitude of the difference to make compelling statistical claims based on the MAGIC criteria !!
concluding that the null hypothesis is very unlikely (based on the p value) is not the same as concluding that…
the difference is large!
the proposed alternative to NHST
Bayesian statistics
advocates for Bayesian statistics argue that the logic of NHST is…
fundamentally flawed
just because the null is unlikely for our data, does not necessarily mean the data are likely to be drawn from a population where our systematic difference is true (i.e., the alternative hypothesis is true)
what is the Bayes factor
what we calculate in Bayesian stats
ratio of the likelihood of the alternative hypothesis relative to the likelihood of the null hypothesis
used to calculate magnitude of difference
interpretation of the Bayes factor
value of 1 means equal likelihood of alternative relative to null
below 1 means null more likely
above 1 means alternative more likely than null (threshold of 3 for moderate evidence, 10 for strong evidence)
objection to Bayesian statistics
does increase in confidence in the alternative relative to the null really translate into magnitude of the effect and how do we interpret that?
raw effect size
measure of the absolute or unadjusted difference between groups or conditions, typically expressed in the original units of measurement of the data, making it interpretable in a practical context
e.g. mean differences or unstandardized regression coefficients (in regression)
when are raw effect sizes useful
when the DV is on a metric that is meaningful and readily interpretable in light of some other criteria
when you want to convey the size of the effect in the same units as the data and when you need to understand the practical significance of an effect
when are raw effect sizes problematic
when the outcome variable is not easily interpretable with respect to specifiable criteria
when one needs to compare effects with outcome variables that are on different metrics
NOTE: dont get used that much in psyc!
what is a standardized effect size
designed to provide a unitless or standardized representation of the effect
common standardized effect size measures include Cohen's d (for comparing means), Pearson's r (for correlational relationships) and Hedges' g (a variant of Cohen's d that corrects for sample size bias)
what is Cohen’s d
one of the most widely used effect size indices which expresses magnitude as a standardized difference between mean
Cohen’s d for independent samples
the mean difference divided by the pooled standard deviation of the two samples
what impacts Cohen’s d (Ds) for independent samples
increases as the mean difference increases and the standard deviations decrease
not influenced by sample size
interpretation/range of Cohen’s d (Ds) for independent samples
has a minimum value of 0 (no difference) and no upper boundary
can be interpreted as the percentage of the standard deviation
0.5 indicates the difference between the means is half the size of
the dependent variable’s standard deviation
1.00 indicates the difference is as big as the standard deviation of
of the dependent variable
2.00 indicates a mean difference twice the size of the standard
deviation of the dependent variable
Cohen’s d (Ds) guidelines for independent samples
0.2 (small), 0.5 (medium), and 0.8 (large)
not based on a compelling theoretical or empirical foundation, chosen arbitrarily
Cohens d for independent samples
dS
Cohen’s d for repeated measures
dAV or dRM
interpretation of Cohens d for repeated measures
𝑑𝑎𝑣 and 𝑑𝑟𝑚 are interpreted in a manner similar to ds
actors affecting the size of 𝑑𝑎𝑣 and 𝑑𝑟𝑚are similar to that of ds
when standard deviations in both sets of observations are equal, 𝑑𝑎𝑣 and 𝑑𝑟𝑚 are equal
dav will tend to be more similar than 𝑑𝑟𝑚 to ds except when r is low and the difference between standard deviations are large
𝑑𝑟𝑚 is more conservative than 𝑑𝑎𝑣 but is considered overly conservative when r is large
what is Hedges g
a modification of Cohen's d (another effect size measure) designed to account for potential bias in the estimation of effect sizes due to small sample sizes
Cohen’s d is a positively biased estimate of the population effect size, particularly for small samples, this effect size corrects d for that bias
what is Pearson’s r coefficient
used to quantify the strength and direction of the linear relationship between two continuous variables (another effect size), it is one of the most common methods for assessing the degree to which two variables are related to each other
what is a point biserial correlation
the relationship between a dichotomous variable (e.g., membership in one of two groups) and a continuous variable (e.g., a dependent variable)
can be expressed through Pearsons r
interpreting r
ranges from -1.00 to 1.00 with .00 indicating no association
standardized effect sizes and their relationship to importance
small effects are not necessarily unimportant and large effects are not necessarily important
why do large effect sizes not directly imply practical significance
metric can be hard to interpret without reference to more concrete reference criteria
durability of an effect might also be relevant in addition to its size
cost/benefit analysis also can determine practicality
when are small effects impressive
when there are minimal manipulations of the independent variable
difficult to influence dependent variable
conceptual consequences of an effect
existence of an effect differentiates between competing theories
existence of an effect challenges reigning theory
existence of an effect demonstrates a new or disputed phenomenon
what can be used to calculate confidence intervals for the sample value
standard error