Chapter 5: Hypothesis Testing and Statistical Significance
Due to sampling error, the samples we select might not be a true reflection of the underlying population.
One of the problems we face when conducting research is that we do not know the pattern of scores in the underlying population. In fact, our reason for conducting the research in the first place is to try to establish the pattern in the underlying population. We are trying to draw conclusions about the populations from our samples.
p-value: the probability of obtaining the pattern of results we found in our study if there was no relationships between the variables in which we were interested in the population
p-value is a conditional probability.
Hypothesis testing is often seen as a competition between two hypotheses. It is seen as a competition between our research hypothesis and null hypothesis.
Null Hypothesis: always states that there is no effect in the underlying population; by effect we mean a relationship between two or more variables, a difference between two or more different populations or a difference in the responses of one population under two or more different conditions
Research Hypothesis: our prediction of how two variables might be related to each other; alternatively, it might our prediction of how specified groups of participants might be different from each other or how one group of participants might be different when performing under two or more conditions
Research hypothesis is often called the experimental or alternate hypothesis.
If the researcher suggests that the null hypothesis could not be rejected, this simply indicates that the statistical probability they calculated meant that it was likely that the null hypothesis was the more sensible conclusion.
If the researcher rejects the null hypothesis, it means that the probability of obtaining their findings if the null hypothesis were true is so small that it makes more sense to believe in the research hypothesis.
If there is no real relationship in the population, you are unlikely to find a relationship in your randomly selected sample. Therefore, if you do find a relationship in your sample, it is likely to reflect a relationship in your population.
Alpha (α): the criterion for statistical significance that we set for our analyses; it is the probability level that we use as a cut-off below which we are happy to assume that our pattern of results is so unlikely as to render our research hypothesis as more plausible than the null hypothesis
On the assumption of the null hypothesis being true, if the probability of obtaining an effect dues to sampling error is less than 5%, then the findings are said to be ‘significant.’ If this probability is greater than 5%, then the findings are said to be ‘nonsignificant'.’
Statistically Significant: our findings when we find that our pattern of research results is so unlikely as to suggest that our research hypothesis is more plausible than the null hypothesis
Not Significant: our findings when we find that our pattern of data is highly probable if the null hypothesis were true
Just because a statistically significant difference is found between two samples of scores, it does not mean that it is necessarily a large or psychologically significant difference.
The probability we calculate in inferential statistics is simply the probability that such an effect would arise if there were no difference between the underlying populations. This does not necessarily have any bearing on the psychological importance of the finding. The psychological importance of a finding will be related to the research question and the theoretical basis of the research.
Statistical significance does not equal psychological significance.
It is important to understand that the p-value is a conditional probability. That is, if you are assessing the probability of an event’s occurrence, given that the null hypothesis is true.
Alpha simply gives an indication of the likelihood of finding such a relationship if the null hypothesis were true. It is perhaps true that the stronger the relationship, the lower the probability that such a relationship would be found if the null hypothesis were true, but this is not necessarily so.
Alpha is the probability that we will get a relationship of an obtained magnitude if the null hypothesis were true. It is not the probability of the null hypothesis being true.
Converting the data from our samples into scores from probability distributions enables us to work out the probability of obtaining such data by chance factors alone. We can then use this probability to decide which of the null and experimental hypotheses is the more sensible conclusion. It should be emphasized here that these probabilities we calculate are based upon the assumption that our samples are randomly selected from the population.
If we were investigating differences between groups we could use probability distributions to find out the probability of finding differences of the size we observe by chance factors alone if the null hypothesis were true. In such a case, we would convert the difference between the two groups of the independent variable into a score from a probability distribution. We could then find out the probability of obtaining such a score by sampling error if no difference existed in the population.
Type I Error: where you decide to reject the null hypothesis when it is in fact true in the underlying population; you conclude that there is an effect in the population when no such effect really exists
If your p-value (α) is 5% then you will have a 1 in 20 chance of making a Type I error. This is because the p-value is the probability of obtaining an observed effect, given that the null hypothesis is true. It is the probability of obtaining an effect as a result of sampling error alone if the null hypothesis is true.
Replication is one of the cornerstones of science.
If you observe a phenomenon once, it may be a chance occurrence; if you see it on two, three, four or more occasions, you can be more certain that it is a genuine phenomenon.
Type II Error: where you conclude that there is no effect in the population when in reality there is an effect in the population; it represents the case when you do not reject the null hypothesis when in fact you should do because in the underlying population the null hypothesis is not true
The probability of making a Type II error is denoted as beta (β).
If we set α at 0.2, we would be tolerating a Type I error in one case in every five. In one case in every five we would reject the null hypothesis when it is in fact true.
On the positive side, we would be much less likely to make a Type II error.
If we set α at 0.001, we are much less likely to make a Type I error. We are only likely to reject the null hypothesis when it is true at one time in every thousand. On the face of it, this would appear to be a very good thing. The problem here is that, although we reduce the probability of making a Type I error, we also increase the probability of not rejecting the null hypothesis when it is false. We increase the probability of making a Type II error.
In most situations an α of 0.05 provides a balance between making Type I and Type II errors.
One-tailed Hypothesis: on where you have specified the direction of the relationship between variables or the difference between 2 conditions; also called a directional hypothesis
Two-tailed Hypothesis: one where you have predicted that there will be a relationship between variables or a difference between conditions, but you have no predicted the direction of the relationship between the variables or the difference between the conditions; also called a bi-directional hypothesis
If you make a two-tailed prediction, the calculated score can fall in either tail. If we use a 5% significance level as our cut-off for rejecting the null hypothesis, we take calculated scores that have a 2.5% probability of being obtained; that is, 5% divided by the two tails.
If we make a one-tailed prediction, we accept scores in only one of the tails and therefore our 5% probability region is all in the one tail; that is, it is not divided between the two tails.
Only the p-value is affected by one-tailed and two-tailed hypothesis distinction. The test statistic (e.g., correlation coefficient or t-value) remains the same for both one- and two-tailed tests on the same set of data.
When making a two-tailed prediction about differences between two conditions, we have only to specify that a difference exists between them. We do not specify which condition will have the higher scores.
If we make a one-tailed prediction, we would predict which of the above scenarios is most appropriate: that is, which condition will have the higher scores.
Many statistical tests that we use require that our data have certain characteristics. These characteristics are called assumptions.
Many statistical tests are based upon the estimation of certain parameters relating to the underlying populations in which we are interested. These sorts of tests are called parametric tests. These tests make assumptions that our samples are similar to underlying probability distributions such as the standard normal distribution.
Non-Parametric or Distribution-Free Tests: where statistical tests do not make assumptions about the underlying distributions or estimate the particular population parameters
The scale upon which we measure the outcome or dependent variable should be at least interval level. This assumption means that any dependent variables that we have should be measured on an interval- or ratio-level scale or, if we are interested in relationships between variables, the variables of interest need to measured using either interval- or ratio-level scales of measurement.
The populations from which the samples are drawn should be normally distributed. Parametric tests assume that we are dealing with normally distributed data. Essentially this assumption means that we should always check that the data from our samples are roughly normally distributed before deciding to use parametric tests. We have already told you how to do this using box plots, histograms or stem and leaf plots. If you find that you have a large violation of this assumption, there are ways to transform your data legitimately so that you can still make use of parametric tests. For example, if you have positively skewed data you can transform all the scores in your skewed variable by calculating the square-root of each score. It has been shown that when we do this it can eliminate positive skew and leave your variable much more normally distributed. Some students think that this is simply changing your data and so cheating. However, this is not the case. All you are doing is converting the variable to a different scale of measurement. It is akin to converting temperature scores from Centigrade to Fahrenheit. As you are doing the same transformation for all scores on the variable it is entirely legitimate.
The third assumption that we cover here is only relevant for designs where you are looking at differences between conditions. This assumption is that the variances of the populations should be approximately equal. This is sometimes referred to as the assumption of homogeneity of variances. We informed you that the standard deviation is the square root of the variance. In practice, we cannot check to see if our populations have equal variances and so we have to be satisfied with ensuring that the variances of our samples are approximately equal. You might ask: what do you mean by approximately equal? The general rule of thumb for this is that, as long as the largest variance that you are testing is not more than three times the smallest, we have roughly equal variances. Generally, a violation of this assumption is not considered to be too catastrophic as long as you have equal numbers of participants in each condition. If you have unequal sample sizes and a violation of the assumption of homogeneity of variance, you should definitely use a distribution-free test.
The final assumption is that we have no extreme scores. The reason for this assumption is easy to understand when you consider that many parametric tests involve the calculation of the mean as a measure of central tendency. If extreme scores distort the mean, it follows that any parametric test that uses the mean will also be distorted. We thus need to ensure that we do not have extreme scores.
Parametric tests are used very often in psychological research because they are more powerful tests. That is, if there is a difference in your populations, or a relationship between two variables, the parametric tests are more likely to find it, provided that the assumptions for their use are met.
Parametric tests are more powerful because they use more of the information from your data. Their formulae involve the calculation of means, standard deviations, and some measure of error variance.
Distribution-free or non-parametric tests are based upon the rankings or frequency of occurrence of your data rather than the actual data themselves.
Because of their greater power, parametric tests are preferred whenever the assumptions have not been grossly violated.
Due to sampling error, the samples we select might not be a true reflection of the underlying population.
One of the problems we face when conducting research is that we do not know the pattern of scores in the underlying population. In fact, our reason for conducting the research in the first place is to try to establish the pattern in the underlying population. We are trying to draw conclusions about the populations from our samples.
p-value: the probability of obtaining the pattern of results we found in our study if there was no relationships between the variables in which we were interested in the population
p-value is a conditional probability.
Hypothesis testing is often seen as a competition between two hypotheses. It is seen as a competition between our research hypothesis and null hypothesis.
Null Hypothesis: always states that there is no effect in the underlying population; by effect we mean a relationship between two or more variables, a difference between two or more different populations or a difference in the responses of one population under two or more different conditions
Research Hypothesis: our prediction of how two variables might be related to each other; alternatively, it might our prediction of how specified groups of participants might be different from each other or how one group of participants might be different when performing under two or more conditions
Research hypothesis is often called the experimental or alternate hypothesis.
If the researcher suggests that the null hypothesis could not be rejected, this simply indicates that the statistical probability they calculated meant that it was likely that the null hypothesis was the more sensible conclusion.
If the researcher rejects the null hypothesis, it means that the probability of obtaining their findings if the null hypothesis were true is so small that it makes more sense to believe in the research hypothesis.
If there is no real relationship in the population, you are unlikely to find a relationship in your randomly selected sample. Therefore, if you do find a relationship in your sample, it is likely to reflect a relationship in your population.
Alpha (α): the criterion for statistical significance that we set for our analyses; it is the probability level that we use as a cut-off below which we are happy to assume that our pattern of results is so unlikely as to render our research hypothesis as more plausible than the null hypothesis
On the assumption of the null hypothesis being true, if the probability of obtaining an effect dues to sampling error is less than 5%, then the findings are said to be ‘significant.’ If this probability is greater than 5%, then the findings are said to be ‘nonsignificant'.’
Statistically Significant: our findings when we find that our pattern of research results is so unlikely as to suggest that our research hypothesis is more plausible than the null hypothesis
Not Significant: our findings when we find that our pattern of data is highly probable if the null hypothesis were true
Just because a statistically significant difference is found between two samples of scores, it does not mean that it is necessarily a large or psychologically significant difference.
The probability we calculate in inferential statistics is simply the probability that such an effect would arise if there were no difference between the underlying populations. This does not necessarily have any bearing on the psychological importance of the finding. The psychological importance of a finding will be related to the research question and the theoretical basis of the research.
Statistical significance does not equal psychological significance.
It is important to understand that the p-value is a conditional probability. That is, if you are assessing the probability of an event’s occurrence, given that the null hypothesis is true.
Alpha simply gives an indication of the likelihood of finding such a relationship if the null hypothesis were true. It is perhaps true that the stronger the relationship, the lower the probability that such a relationship would be found if the null hypothesis were true, but this is not necessarily so.
Alpha is the probability that we will get a relationship of an obtained magnitude if the null hypothesis were true. It is not the probability of the null hypothesis being true.
Converting the data from our samples into scores from probability distributions enables us to work out the probability of obtaining such data by chance factors alone. We can then use this probability to decide which of the null and experimental hypotheses is the more sensible conclusion. It should be emphasized here that these probabilities we calculate are based upon the assumption that our samples are randomly selected from the population.
If we were investigating differences between groups we could use probability distributions to find out the probability of finding differences of the size we observe by chance factors alone if the null hypothesis were true. In such a case, we would convert the difference between the two groups of the independent variable into a score from a probability distribution. We could then find out the probability of obtaining such a score by sampling error if no difference existed in the population.
Type I Error: where you decide to reject the null hypothesis when it is in fact true in the underlying population; you conclude that there is an effect in the population when no such effect really exists
If your p-value (α) is 5% then you will have a 1 in 20 chance of making a Type I error. This is because the p-value is the probability of obtaining an observed effect, given that the null hypothesis is true. It is the probability of obtaining an effect as a result of sampling error alone if the null hypothesis is true.
Replication is one of the cornerstones of science.
If you observe a phenomenon once, it may be a chance occurrence; if you see it on two, three, four or more occasions, you can be more certain that it is a genuine phenomenon.
Type II Error: where you conclude that there is no effect in the population when in reality there is an effect in the population; it represents the case when you do not reject the null hypothesis when in fact you should do because in the underlying population the null hypothesis is not true
The probability of making a Type II error is denoted as beta (β).
If we set α at 0.2, we would be tolerating a Type I error in one case in every five. In one case in every five we would reject the null hypothesis when it is in fact true.
On the positive side, we would be much less likely to make a Type II error.
If we set α at 0.001, we are much less likely to make a Type I error. We are only likely to reject the null hypothesis when it is true at one time in every thousand. On the face of it, this would appear to be a very good thing. The problem here is that, although we reduce the probability of making a Type I error, we also increase the probability of not rejecting the null hypothesis when it is false. We increase the probability of making a Type II error.
In most situations an α of 0.05 provides a balance between making Type I and Type II errors.
One-tailed Hypothesis: on where you have specified the direction of the relationship between variables or the difference between 2 conditions; also called a directional hypothesis
Two-tailed Hypothesis: one where you have predicted that there will be a relationship between variables or a difference between conditions, but you have no predicted the direction of the relationship between the variables or the difference between the conditions; also called a bi-directional hypothesis
If you make a two-tailed prediction, the calculated score can fall in either tail. If we use a 5% significance level as our cut-off for rejecting the null hypothesis, we take calculated scores that have a 2.5% probability of being obtained; that is, 5% divided by the two tails.
If we make a one-tailed prediction, we accept scores in only one of the tails and therefore our 5% probability region is all in the one tail; that is, it is not divided between the two tails.
Only the p-value is affected by one-tailed and two-tailed hypothesis distinction. The test statistic (e.g., correlation coefficient or t-value) remains the same for both one- and two-tailed tests on the same set of data.
When making a two-tailed prediction about differences between two conditions, we have only to specify that a difference exists between them. We do not specify which condition will have the higher scores.
If we make a one-tailed prediction, we would predict which of the above scenarios is most appropriate: that is, which condition will have the higher scores.
Many statistical tests that we use require that our data have certain characteristics. These characteristics are called assumptions.
Many statistical tests are based upon the estimation of certain parameters relating to the underlying populations in which we are interested. These sorts of tests are called parametric tests. These tests make assumptions that our samples are similar to underlying probability distributions such as the standard normal distribution.
Non-Parametric or Distribution-Free Tests: where statistical tests do not make assumptions about the underlying distributions or estimate the particular population parameters
The scale upon which we measure the outcome or dependent variable should be at least interval level. This assumption means that any dependent variables that we have should be measured on an interval- or ratio-level scale or, if we are interested in relationships between variables, the variables of interest need to measured using either interval- or ratio-level scales of measurement.
The populations from which the samples are drawn should be normally distributed. Parametric tests assume that we are dealing with normally distributed data. Essentially this assumption means that we should always check that the data from our samples are roughly normally distributed before deciding to use parametric tests. We have already told you how to do this using box plots, histograms or stem and leaf plots. If you find that you have a large violation of this assumption, there are ways to transform your data legitimately so that you can still make use of parametric tests. For example, if you have positively skewed data you can transform all the scores in your skewed variable by calculating the square-root of each score. It has been shown that when we do this it can eliminate positive skew and leave your variable much more normally distributed. Some students think that this is simply changing your data and so cheating. However, this is not the case. All you are doing is converting the variable to a different scale of measurement. It is akin to converting temperature scores from Centigrade to Fahrenheit. As you are doing the same transformation for all scores on the variable it is entirely legitimate.
The third assumption that we cover here is only relevant for designs where you are looking at differences between conditions. This assumption is that the variances of the populations should be approximately equal. This is sometimes referred to as the assumption of homogeneity of variances. We informed you that the standard deviation is the square root of the variance. In practice, we cannot check to see if our populations have equal variances and so we have to be satisfied with ensuring that the variances of our samples are approximately equal. You might ask: what do you mean by approximately equal? The general rule of thumb for this is that, as long as the largest variance that you are testing is not more than three times the smallest, we have roughly equal variances. Generally, a violation of this assumption is not considered to be too catastrophic as long as you have equal numbers of participants in each condition. If you have unequal sample sizes and a violation of the assumption of homogeneity of variance, you should definitely use a distribution-free test.
The final assumption is that we have no extreme scores. The reason for this assumption is easy to understand when you consider that many parametric tests involve the calculation of the mean as a measure of central tendency. If extreme scores distort the mean, it follows that any parametric test that uses the mean will also be distorted. We thus need to ensure that we do not have extreme scores.
Parametric tests are used very often in psychological research because they are more powerful tests. That is, if there is a difference in your populations, or a relationship between two variables, the parametric tests are more likely to find it, provided that the assumptions for their use are met.
Parametric tests are more powerful because they use more of the information from your data. Their formulae involve the calculation of means, standard deviations, and some measure of error variance.
Distribution-free or non-parametric tests are based upon the rankings or frequency of occurrence of your data rather than the actual data themselves.
Because of their greater power, parametric tests are preferred whenever the assumptions have not been grossly violated.