Chapter 8: Introduction to Hypothesis Testing

hypothesis testing: a statistical method that uses sample data to evaluate a hypothesis about a population. inferential procedure. super common. details change depending on situation, but general process remains constant.
- the basics:
  - state a hypothesis about the population, usually concerning the value of a population parameter.
  - use the hypothesis to predict the characteristics the sample should have. eg, if you estimate population’s mean to be 7, sample’s should also be around 7 (not necessarily exactly 7, but close to 7).
  - obtain a random sample from the population.
  - compare the obtained sample data with the prediction made from the hypothesis. if consistent with prediction, we conclude hypothesis is reasonable; if big discrepancy, we decide hypothesis is wrong.
one situation involves starting with a known population (a set of individuals as they exist before treatment) and then comparing their data to determine what happens to the population after a treatment is administered.
- we make the basic assumption that if the treatment has any effect, it is simply to add/subtract a constant amount from each individual’s score, so the mean [but not the shape and standard deviation!] should change.
- the research is focused on the unknown population after treatment, not the original population.
- also note that we assume we know the parameter for the population before treatment.
- since we usually can’t administer a treatment to an entire population, we take a representative sample and draw conclusions about the treatment for the entire population based on the treatment for that sample. note that the unknown population is always hypothetical—the treatment is never administered to the entire population, and we are instead asking what would happen if we did.
- although the treated sample is obtained indirectly (ie, by taking sample of untreated population), it is still equivalent to a sample obtained directly from the unknown treated population (which, as we said, is always hypothetical).
hypothesis testing is a formalised procedure that follows a standard series of operations; thus researchers have a standardised method of analyzing and evaluating the results of their studies. this in turn means that other researchers can read one another’s work and understand right away how that data was evaluated and how the conclusions were reached.
basic 4-step process of hypothesis testing (which we will use later and adjust/elaborate):
- step 1: state the hypothesis.
  - actually, we state 2 opposing hypotheses in terms of population parameters.
    - the first hypothesis is called the null hypothesis (H₀) and states that in the general population there is no change, no difference, or no relationship. in the context of an experiment, it predicts that the independent variable (treatment) has no effect on the dependent variable (scores) for the population. this is the more important hypothesis. the subscript denotes that this is the zero-effect hypothesis. symbolically, we write , or that the mean with subscript of independent variable is equal to the control’s mean/entire untreated population’s mean.
    - the second is called the alternative hypothesis (H₁), which states that there is a change, a difference, or a relationship for the general population. in the context of an experiment, H₁ predicts that the independent variable (treatment) does have an effect on the dependent variable. also known as the scientific hypothesis. written symbolically as , or that the mean with subscript of independent variable is not equal to the control’s mean/entire untreated population’s mean.
      - note that this doesn’t say what the change will be, only that there will be a change. in some situations, it’s appropriate to specify the direction of the effect, eg with a < or > instead of a ≠.
- step 2: set the criteria for a decision.
  - eventually, we use data to support or refute the null hypothesis. to formalize this process, we use the null hypothesis to predict the kind of sample mean we ought to obtain—which sample means are consistent with the null hypothesis and which are at odds with it? to determine which values fall where on that spectrum, we examine all the possible sample means that could be obtained if the null hypothesis is true with a distribution of sample means of the given sample size. as before, the extreme tails are the noteworthy areas (ie, those at odds with the null hypothesis).
  - to find the boundaries that separate the high-probability samples from the low-probability samples, we select a specific value known as the level of significance or alpha level (α) for the hypothesis, which is a small probability used th ID the low-probability samples. by convention, we commonly use α=.05 (5%), α=.01 (1%), and α=.001% (0.1%). that means we sample the most extreme 5, 1, or 0.1% from the most likely 95, 99, or 99.9% of the sample means. the extremely unlikely values, as defined by the alpha level, make up the critical region. these extreme values in the tails define outcomes inconsistent with the null hypothesis. if data from a research study produces a sample mean located in the critical region, we reject the null hypothesis.
    - alpha level/level of significance: a probability value used to define the concept of “very unlikely” in a hypothesis test.
    - critical region: composed of the extreme sample values that are very unlikely (as defined by the alpha level) to be obtained if the null hypothesis is true. the boundaries for the critical region are determined by the alpha level. if sample data fall in the critical region, the null hypothesis is rejected. we can also define this as sample values that provide convincing evidence that the treatment really does have an effect.
  - to determine the exact location for the boundaries that define the critical region, we use the alpha-level probability and the unit normal table—in most cases, the distribution of sample means is normal, so the unit normal table provides the precise z-score locations for 5%, 1%, and 0.1% (chosen based on your chosen α). eg, we know from previous chapters that the z-scores for the extreme 5% are ±1.96. (for 1%, it’s ±2.58 or ±2.57 (both are equally good), and for 0.1%, it’s ±3.30.)
- step 3: collect data and compute sample statistics.
  - only at this point do we run the study and collect the data. this ensures that the researcher makes an honest and objective evaluation of the data and doesn’t tamper with the decision criteria after the experimental outcome is already known.
  - then we summarise the raw data from the sample using the appropriate statistics, often the sample mean. then we compare the sample mean (data) with the null hypothesis. (this is truly the heart of hypothesis testing.) this comparison is usually done by finding the z-score for the sample mean relative to the hypothesized population mean from H₀. we calculate this z-score with , or .
    - “Notice that the top of the z-score formula measures how much difference there is between the data and the hypothesis. The bottom of the formula measures the standard distance that ought to exist between a sample mean and the population mean.”
    - recall that we calculate standard error with where n=sample size.
- step 4: make a decision.
  - using the z-score you obtained in step 3, decide whether to accept or reject your null hypothesis based on the criteria you established in step 2.
    - if in critical region, we reject the null hypothesis, concluding the treatment does, in fact, have an effect.
    - if not in critical region, we fail to reject the null hypothesis, meaning the treatment does not appear to have an effect.
it can be helpful to think of rejecting the null hypothesis and failing to reject the null hypothesis as we think of jury trials, where we are either guilty or not guilty, not guilty or innocent. just as we don’t conclude that the defendant is innocent if they aren’t adequately proven guilty, we don’t assume that the treatment doesn’t work just because we fail to reject the null hypothesis—it just means we don’t have enough evidence yet that it does work.
the z-score stat used in the hypothesis test is our first specific example of a test statistic, a term that indicates that the sample data are converted into a single, specific statistic that is used to test the hypotheses. we’ll use several others later on, but most have the same basic structure and purpose as the z-score.
in a hypothesis test with z-scores, we essentially have a baking recipe where one ingredient’s value is missing. we have the formula for z-scores but don’t know the value for the population mean, μ, so we:
- make a hypothesis about the value of μ (this is the null hypothesis).
- plug the hypothesized value into the formula along with the other values.
  - note that in this equation, the numerator measures the obtained difference between the sample mean and hypothesized population mean. the standard deviation in the denominator measures the natural amount of standard distance between the sample mean and population mean without any treatment. thus the z-score forms a ratio of actual difference between sample M and hypothesis μ ~versus~ standard difference between M and μ without treatment. thus a z-score of 3, eg, means the difference between the sample and hypothesis is 3 times higher than we would expect if the treatment had no effect.
  - in general, a large value for a test statistic like z-score indicates a large discrepancy between sample data and null hypothesis, ie that the sample data are very unlikely to have occurred purely by chance. this means that when we obtain a large value (which is in the critical region), we can conclude the treatment must have had an effect.
- if the formula produces a z-score near 0 (which is where z-scores are supposed to be), we conclude the hypothesis was correct; if it produces an extreme value, we conclude it was wrong.
there is always a possibility that an incorrect conclusion will be made in the hypothesis-testing situation!! there are 2 types of errors in a hypothesis test:
- Type I errors (alpha): occur when a researcher rejects a null hypothesis that is actually true (ie, say treatment has effect when it doesn’t). in a typical research situation, a Type I error means the researcher concludes that a treatment does have an effect when in fact it has no effect. sometimes you just so happen to draw an extreme sample that makes it look like the treatment has a huge effect, but really, you just happened to draw an extreme sample. it’s always possible, although chances of this error are lower the larger the sample is. note that this isn’t the researcher’s fault—their conclusion is logical given their data, it’s just that their sample is misleading.
  - in most situations, these kinds of errors are really serious since rejecting the null hypothesis means they’re more likely to report and even publish their data even though their data is skewed. this can then lead other researchers to formulate hypotheses and even entire studies based on false information.
  - the hypothesis test minimises the risk of Type I errors—remember that the alpha level determines what % of of samples have means in the critical region, so your α determines your chances of obtaining a sample mean in the critical region when the hypothesis is true. (eg: α=.05 means that there’s only a 5% chance of pulling a sample in the critical region when treatment has no effect.) so…
    - the alpha level for a hypothesis test is the probability that the test will lead to a Type I error. that is, the alpha level determines the probability of obtaining sample data in the critical region even though the null hypothesis is true.
  - note: when the data is in the critical region, the appropriate conclusion is still to reject the null hypothesis! most of the time, this is correct. the risk of a Type I error is low and under the researcher’s control (based on what α they choose).
- Type II errors (beta): occur when a researcher fails to reject a null hypothesis that is in fact false (ie, says treatment has no effect when it does). in a typical research situation, a Type II error means that the hypothesis test has failed to detect a real treatment effect. this happens when the sample mean is not in the critical region even though the treatment does have an effect on the sample. this is most likely to happen when treatment effects are relatively small—treatment does still influence the sample, but not to the magnitude necessary for our study to pick up on it. in these cases, researchers conclude that there is not enough evidence to say that there is a treatment effect.
  - results aren’t usually as serious as Type I errors—these just mean the research data doesn’t show the treatment effect the researchers hoped for. the researcher can conclude that there’s no effect/that the effect is so small that it’s not worth further pursuing, or the researcher can repeat the study with a new sample (and generally a larger sample) to try to demonstrate that the treatment really does work.
  - unable to determine a single, exact probability for a Type II error. instead, it depends on a number of factors and is therefore a function rather than a specific number. still, we represent it with the Greek letter beta, β.
as we know, the researchers select the alpha level, and the alpha level is very important! so how do we choose an alpha level?
- primary concern is to avoid Type I error, so we tend to have very small probability values for our alpha levels (thus .05, .01, and .001 being the common ones). by convention, the largest permissible value is α=.05. when there is no treatment effect, this means you only have a 5% risk (or a 1 in 20 probability) or rejecting the null hypothesis and committing a Type I error. since 5% is still relatively high, though, many researchers and scientific publications prefer more convervative levels like .01 or .001.
- however!! it might seem like choosing a tiny α is the best choice, but that means the hypothesis test needs much more evidence from the research results.
- these 2 issues are balanced by controlling the boundaries of the critical region. the data must be in the critical region for us to determine treatment has an effect, and if the treatment has an effect, it should push the data into the critical region. however, as α is lowered, the critical region moves farther and farther out and is harder and harder to reach. this also means that as α decreases, the distance between the sample mean and the value of μ [as stated in the null hypothesis] increases. extremely small alpha regions are almost impossible to find even in successful treatments (though that also means virtually no chance of a Type I error). so, the alpha values that researchers consider reasonably good are .05, .01, and .001, as they provide a low risk of error without placing excessive demands on the research results.
in literature, we don’t explicitly state that we used a z-score as a test stat with an alpha level of .05, nor will we be told “the null hypothesis is rejected.” instead, we write it as, “IV had a significant effect on DV, z=#.##, p<.05.
- we consider a result [statistically] significant if it is very unlikely to occur when the null hypothesis is true. that is, the result is sufficient to reject the null hypothesis. thus, a treatment has a significant effect if the decision from the hypothesis test is to reject H₀.
- when we put p<.0#, we’re saying that there is a possibility of a Type I error, and that the probability of such is #%. this is the alpha value.
- if we fail to reject H₀, we state something like, “IV did not have a significant effect on DV, z=#.##, p>.05.
- when a hypothesis is conducted using a computer programme, the software will often give you an exact p value, which researchers are encouraged to report as-is, ie p=.0#### instead of using < or > notation. regardless, the p value still needs to be <.05 in order to be considered significant.
the final decision in a hypothesis test is determined by the z-score statistic. this value is determined by…
- the difference between the sample mean and the hypothesized population mean from H₀. a big difference indicates noticeable difference, supporting a conclusion that the treatment effect is significant.
- the standard error, which we recall is defined as standard deviation divided by the square root of the sample size.
- the variability of the scores! recall that high variability makes it difficult to find clear patterns in research study results. in a hypothesis test, the higher the variability, the lower the chances of finding a significant treatment effect [if other values held constant]. generally, larger variability produces larger standard error and smaller value (closer to 0) for the z-score.
- the size of the sample! the larger the sample, the more likely we are to find an adequately high z-score if our treatment works, ie the more likely to result in a significant treatment effect. generally, decreasing sample size [when other values are held constant] produces larger standard error and smaller z-scores (closer to 0).
in practice, researchers don’t concern themselves much with the basic assumptions underlying hypothesis testing because the tests usually work well even when the assumptions are violated; however, we as starters need to know and understand the assumptions that must be satisfied lest the hypothesis test be compromised:
- random sampling: it is assumed that participants in the study were selected randomly (which means the results of the study are generalizable to the general population).
- independent observations: the values in the sample must consist of independent observations! in other words, 2 observations are independent if there is no consistent, predictable relationship between the first observation and the second. more precisely, 2 events are independent if the occurrence of #1 has no effect on the probability of #2. this is usually satisfied by using a random sample, esp wherein individuals aren’t related (whether familially or as friends).
  - eg: coin toss. the chances of obtaining heads or tails is not affected by the previous coin tosses (the chance is always 50% for each), although “gambler’s fallacy” has many of us believe that, eg, the chances of obtaining a heads after 4 tails in a row are increased. they’re not. the coin doesn’t have a memory and doesn’t gaf about chances. the chance is always 50%.
  - note that this means sampling with replacement is crucial!! :O
- the value of σ is unchanged by the treatment: we formulate the standard error, σ_M, based on σ of the population, which means we assume the standard deviation doesn’t change with treatment!!
  - (this is actually due to a more general assumption that we make in many statistical procedures: the effect of the treatment is to add or subtract a constant amount from every score in the population, meaning the mean changes but not the standard deviation or shape. note that this is a theoretical ideal—IRL, treatments rarely show a perfect and consistent additive effect.)
- normal sampling distribution: we have used the unit normal table to evaluate hypotheses with z-scores, meaning we have assumed the distribution of scores is normal.
so far, we’ve been discussing the two-tailed/standard [hypothesis] test format (named such because the critical zone is divided between the two tails of the curve). there is an alternative…
in a directional hypothesis test/one-tailed test, the statistical hypotheses H₀ and H₁ specify either an increase or a decrease in the population mean—that is, they make a statement about the direction of the effect.
- first, most critical step is to state the statistical hypothesis. the null hypothesis will still be that the treatment does not work; the alternative hypothesis is now that the treatment words as predicted (ie, DV increases/decreases as specified). it’s generally easiest to do this with inequalities, such as H₁: μ > 16 for an increase from an initial mean of 16 and H₀: μ ≤ 16 for the same scenario. note that the 2 hypotheses are still mutually exclusive and still cover all the possibilities. also note that the hypotheses concern the general population, not just the specific sample.
- the critical region is defined as the area that provides convincing evidence that our null hypothesis is incorrect. in the case of one-tailed tests, the easiest way to define the location is with this definition: we start with a DOSM and determine the appropriate region based on our α, but instead of splitting it between both tails, we take the entire thing and consider only the pertinent tail (the left-hand if expecting a decrease and the right-hand if expecting an increase). thus for α=.05, you need z ≥ 1.65.
- other than these 2 changes, the tests are identical.
- in the literature, we specify that we used a one-tailed test, eg: “IV affected DV, z=1.80, p<.05, one-tailed.”
the critical factor in the decision of whether to accept or reject H₀ is the size of the difference between the treated sample and the original population—a large difference is evidence that the treatment worked, while a small difference is not sufficient to say the treatment had any effect.
- key difference with one- vs two-tailed tests is that the one-tailed test allows you to reject the null hypothesis with a smaller difference so long as the difference is in the correct direction.
- due to this ^ difference, researchers, although in agreement that these types of test are different and distinct, argue about which is ‘better.’ some say the two-tailed is more rigorous and thus more convincing than a one-tailed test; others say that one-tailed tests are preferable because they’re more sensitive (a relatively small treatment effect may be significant for a one-tailed but not a two-tailed test). so, we generally say that two-tailed tests should be used when there’s no strong directional expectation or when there are 2 competing predictions, while one-tailed tests should be used only in situations when the directional prediction is made before the research is conducted and there is a strong justification for that directional prediction. you should never fail a two-tailed test and then follow it up with a one-tailed test as an attempt to salvage a significant result for the data!
the hypothesis test procedure isn’t perfect!!
- demonstrating a significant treatment effect does not necessarily mean indicating a substantial treatment effect!! indeed, saying it’s statistically significant doesn’t even provide us any real info about the absolute size of the treatment effect, just that the number obtained is very unlikely without a treatment effect. this conclusion relies on a relative comparison (ie, size of treatment effect is considered relative to standard error). with a small standard error, even a relatively small treatment effect is deemed significant; thus a significant effect does not always mean a big effect.
  - also, if the sample size is large enough, any treatment effect, no matter how small, can be enough for us to reject the null hypothesis!
due to this ^^ , researchers are supposed to report the effective size when they report a statistically significant event! to do so, there are a few different ways to measure and report effect size, which is intended to provide a measurement of the absolute magnitude of a treatment effect, independent of the size of the sample(s) being used:
- Cohen’s d is one of the simplest and most direct methods of measuring effect size; measures the mean difference in terms of standard deviation. it is found using the equation . for the z-score hypothesis test, we use the mean of the untreated population and then substitute the treated sample’s mean for the treated population’s mean [since we don’t have that number]; thus the actual calculations are actually an estimate of the values of Cohen’s d using M_treatment instead of μ_treatment. also note that the standard deviation is included in the equation to standardise the size of the mean difference, similar to how z-scores standardise locations in a distribution.
  - this means that Cohen’s d also indicates how it changes in terms of standard deviation—d=0.5 means treatment increases all the scores by half a standard deviation, eg!
  - note that this, unlike the z-score, doesn’t consider sample size, meaning bigger samples aren’t automatically going to seem more convincing than smaller samples!
  - Cohen outlined that a weak effect is 0.2, a moderate effect is 0.5, and a strong effect is 0.8.
- the power of a statistical test is the probability that the test will correctly reject a false null hypothesis, ie the probability that the test will identify a treatment effect if one really exists. since there are only two outcomes (reject false H₀ or fail to reject false H₀ (ie, Type II error)), they must add up to 1.00 (100%), and we’ve already IDed the probability of a Type II error as β, meaning that the power of the test (ie, the possibility that we will correctly reject a false H₀) is p=1-β.
  - we typically calculate power as a means of determining whether or not a research study is likely to be successful, so we usually calculate it before the study even begins! to calculate power, though, we first need to make assumptions about a variety of the factors that influence the outcome of a hypothesis test, such as sample size, the size of the treatment effect, and the value chosen for the alpha level.
    - eg: if we take samples of 4 from a population with a mean of 80 and a standard deviation of 10 and we expect the treatment to add 8 points to each individual’s score, we can then calculate standard error (standard deviation over square root of sample size) to be 5. using α=.05, we determine the critical region to be ±1.96; since we’re expecting an increase, we focus on the right-hand area. when we plot the DOSM and the DOSM with the 8-point effect on the same line graph, we see that about ⅓ of the DOSM with the 8-point effect is beyond the z=1.96 mark, meaning that about ⅓ of the time, we should get a score that leads us to reject H₀. we can back this mathematically by calculating the product of our z-score (1.96) and our standard error (5), which is 9.80 points. we can then find our new average by taking the initial mean plus this difference, finding 89.80 as our new mean. we calculate this number’s z-score as 0.36, and, using our table, we find that this z-score corresponds to p=0.3594, or 35.94% of scores in the tail. thus, about 36% of the time, we will reject our null hypothesis, and the study’s power is about 36%. this means that the study has a relatively small chance of being successful (about 1 in 3) with a sample of 4 people.
  - obviously, power and effect are related—eg, we just found that with an effect of +8, the power is 35.94%. in that same situation, if we doubled the effect to +16, we would find that about 90% of the treated sample lies beyond z=1.96, meaning the study’s power is about 90%. the larger the effect, the larger the power!
  - also note that the power [of a hypothesis test] is not supposed to be a pure measurement of effect size and is directly influenced by:
    - sample size (since it changes standard error)—generally, larger sample size = greater power
      - we can also use power to determine the appropriate sample size for a study—what size sample will give us a reasonable probability for a successful research study? if the power is too low when computed with one sample size, increase the size and see if that fixes the problem!
    - alpha level—reducing alpha level also reduces power by pushing z out further, meaning less likelihood of a value falling at or beyond it even with very profound treatment effects
    - one-tailed vs two-tailed—changing test from two-tailed to one-tailed test increases power of test because critical region on right-hand side is larger in one-tailed tests (since %age is only at that end, not split evenly between left and right tails).