Inferential Statistics Lecture Notes

Introduction to Inferential Statistics

Inferential statistics allows us to make claims about cause-and-effect relationships, especially in experimental studies using random assignment.
These claims rely on statistical assumptions.
This lecture unit focuses on statistical assumptions and applications for making statistical claims about cause and effect.
Key topics include probability theory, hypothesis testing, and effect sizes.
Probability is the likelihood of a particular outcome, influencing our daily decisions.
Inferential statistics uses probability to generalize sample results to a larger population.
Two inferential techniques that will be covered are the single sample z-test and the single sample t-test, with examples of their use in experimental designs.

Probability

We make decisions daily based on the likelihood of events.
Example: Probability of winning the lottery (USA: 1 in 175,711,536 before 2012).
Probability depends on the number of participants; calculated as the ratio of prizes to people.
Probability calculation relies on mathematical assumptions and distribution properties.
Statistical tests often assume a normal distribution, which requires a large number of symmetrical scores.
In a normal distribution, the mean, median, and mode are at the center.
Each point on the normal distribution has a specific probability value.
The normal curve is continuous, with infinitely small probabilities at each point.
For discrete distributions, specific points are used to specify probability, ranging from 0 to 100%.
Probability theory is used to make scientific or statistical decisions.
Inferential statistics uses probability to assess the commonality or unusualness of an event.
In psychology, probability models test a sample and generalize results to a larger population.

Hypothesis Testing

The research hypothesis (or alternative hypothesis) is what we hope to be true.
It's used practically to describe our research ideas in writing.
Two types of hypotheses are used in the probability model: the research hypothesis and the null hypothesis.
The null hypothesis is an underlying assumption of statistical testing that is not usually articulated.
Probability theory relies on unusual events.
Hypothesis testing specifies a point at which an outcome is so unlikely it probably belongs to a different population.
Because hypothesis testing is based on probabilities, findings are never certain.
Instead, we rely on a preponderance of evidence.
Replicating unlikely results provides additional evidence of an effect.
Scientists collect evidence of an effect rather than proving events.
Repeated unusual outcomes suggest a condition of the independent variable (IV) is producing these outcomes.
The null hypothesis is a statement that a treatment has no effect. It's used only for statistical testing.
Alternative and null hypotheses can be written as follows:
- Alternative Hypothesis: $H1 eq H2$
- Null Hypothesis: $H0 = H2$

Hypothesis Testing (Continued)

The alternative hypothesis indicates that two population means are not equal (what we hope to find).
The null hypothesis indicates population means are equal (used for statistical testing).
Conceptually, the null hypothesis is used to conduct inferential statistics.
We test if groups differ by establishing a threshold (alpha) for rejecting the null hypothesis.
Alpha ( $\alpha$ ) is a probability value, typically set to 0.05, based on the normal distribution.
There is a 5% chance (alpha) of declaring a statistically significant difference when none exists.
The null hypothesis is rejected when a statistic falls in the very small probability area of the distribution, indicating statistically significant results.
Rejecting the null hypothesis suggests the IV had an effect because the results are unlikely if the IV had no effect.
A significant difference suggests the treatment group mean is disparate enough from the control group mean to infer that the two means came from different populations.
A non-significant difference indicates the treatment group scores may have come from the same population, meaning the null hypothesis cannot be rejected.

Decisions in Hypothesis Testing

Decisions are based on probability when rejecting a null hypothesis.
We can also decide not to reject the null hypothesis if the data did not fall beyond the rejection threshold (alpha).
In this case, results don't confidently state that means were different enough to conclude they came from different populations.
We conclude there were no differences between sample means.
Hypothesis testing involves deciding to either reject or not reject the null hypothesis based on the detection of an effect.
It's crucial to note that we don't "accept the null hypothesis" because we are not trying to prove there is no effect, but rather looking for enough change in data to pass the probability threshold.
Even when this decision is made, it can be wrong.

Decisions in Statistical Hypothesis Testing

Two decisions are possible: (a) reject the null hypothesis, or (b) do not reject the null hypothesis.
Either decision can be a mistake due to the inherent error in probability estimates.
The amount of error is related to sample size; smaller samples have larger errors, while larger samples have less error.
Despite large samples, some error will always be present, meaning we can never be absolutely certain about outcomes unless we measure the entire population.

Rejecting the Null Hypothesis

Determining if rejecting the null hypothesis is correct requires knowing the actual state of affairs.
However, we rarely know if a treatment has a real effect since we're always working with samples.

Sample Representation

Using a sample means dealing with a subset; therefore, our sample may not accurately represent the population.
In inferential statistics, decisions are based on the likelihood of an event or the theory of probability.
We make an educated guess based on the sample because we lack complete information.
Rejecting the null hypothesis asserts that results are unlikely if the treatment had no effect, but this decision can be incorrect.
A Type I error (false positive) occurs when we conclude a treatment had an effect, but it really did not (labeled as alpha $\alpha$ = 0.05).
Alpha is the predetermined rejection region and the likelihood of a false positive.
Example: A metal detector sounding an alarm when there is no weapon is a Type I error.
Despite statistical significance, results may not be accurate due to probability.

Type I Error

If we make a Type I error, we decide there is an effect, but we are wrong because no actual effect is present.
If we reject the null hypothesis correctly, an effect is present, and we make a correct decision.
'Statistical Power' is the likelihood of correctly detecting an effect.
Power is the probability that our decision to reject the null hypothesis is correct.
Power is the likelihood of accurately detecting an effect.

Not Rejecting the Null Hypothesis

When we decide not to reject the null hypothesis, we assert that our results are inconclusive and the treatment probably did not have an effect.
At the very least, we conclude we don't have enough information to decide that there is an effect.
We are correct in this decision if there is no effect.
If the sample statistic doesn't fall beyond the level of alpha, and we conclude that no effect is present, this decision is accurate if the treatment, in reality, did not have an effect.
For example, If dogs in a treatment group didn’t learn to sit (whilst it was hypothesized for them to sit), then we correctly conclude that the operant conditioning treatment is not effective.

Type II Error

Our decision that there was not an effect can be wrong.
If operant conditioning is truly effective, but the dogs didn't learn to sit, we failed to detect the effect (false negative).
A Type II error is defined as not rejecting the null hypothesis when we should have, meaning the hypothesis is really false.
In this case, we did not detect a statistically significant difference when one was present.
Another way to describe a Type II error is as a false negative.
A Type II error or beta ( $\beta$ ) is not rejecting the null hypothesis when the null hypothesis is false.
If we knew the true state of affairs, we would know we erroneously concluded there was not an effect when one was present.
A Type II error may occur because statistical power is low.

Factors Influencing Power

One way to increase power is to increase sensitivity of measurement.
- Using an instrument sensitive to changes in the DV increases the likelihood of accurately detecting an effect.
- E.g., airport workers can calibrate a metal detector to be more or less sensitive.
A second way to increase power is to increase the number of people or, in the case of our dog training example, the number of dogs in our sample.
- Increasing the sample size decreases the size of the error in our calculations.
- Remember, increasing sample size decreases error, so a decrease in the error, or the standard error of the mean (SEM), increases our ability to accurately detect an effect.

Effect Size

Power is also affected by effect size (amount of change in the DV).
Effect size reflects the estimated standardized difference between two means or measures of relationship between the independent and dependent variables (Effect size, or the amount of change in the dependent variable, is calculated for each inferential test).
Much like alpha, we typically specify effect size (we simply specify the amount of effect that we consider to be important, e. g., a one ounce increment in a metal detector – making it heavier) in order to make the standardized difference between means bigger.
A large effect size suggests that the standardized difference between means is bigger, and this increases the calculated value of power (if we believe a treatment has a very small effect, it will be difficult to detect a change in the DV; hence, a small effect size results in lower power).
Effect size is a calculation determining the difference in measures between the treatment and control groups relative to the standard deviation for one or both groups.

Normal Distribution

The standardized normal distribution (sometimes called the z-distribution) contains scores that are associated with standard areas within the normal distribution.
Most scores in a normal distribution are grouped near the middle, with very few scores located in the two ends of the distribution.
Because the normal distribution is broken down into standard areas, it is convenient to calculate percentiles.

Using Normal Distribution to Calculate Percentiles

We can use the normal distribution to determine the percentage of people above or below any given point.
These percentages serve as estimates of probability.
A standard percentage of the normal distribution is contained in the area between the mean and one standard deviation away from the mean.
The actual amount of area is 0.3413, or approximately 34% of the distribution is located between the mean and the first standard deviation.
The normal curve is symmetrical, so the same proportion of the distribution is located between the mean and one standard deviation below the mean.
You will notice that as we move further away from the middle of the distribution, the area under the curve becomes smaller.
This smaller area corresponds to a lower percentage of the distribution and lower probability.
Exact probability values have been calculated for each point in the normal distribution, and the probabilities are conveniently reported in a unit normal table.
In order to calculate an exact probability for an individual score, we must first calculate a z-score.
z = rown{\frac{X - \mu}{\sigma}}, where X = individual score; $\mu$ = population mean; and, $\sigma$ = population standard deviation.
Once we calculate the z-score, we can use it to compare an individual score to a population mean.
In other words, we can gauge how likely or common a particular score is in the population.

Sampling Distributions

Sampling forms the basis for all inferential testing.
We sample or collect data on a dependent variable and then we compare the sample data to the theoretical population distribution.
We can calculate the probability that our sample comes from the larger population.
This idea of comparing a sample mean to a population mean forms the basis of hypothesis testing.
However, in order to understand hypothesis testing, we must consider many types of sampling distributions.
When we collect data from a group of people who we think represent the larger population, we might want to compare the mean of that sample or group to the population mean.
In making this comparison we are theoretically using all samples of a particular size (e. g., N =5) and considering where our sample falls with respect to the population of means.
Therefore, instead of using the normal distribution to estimate the likelihood of obtaining an individual score as compared to the population mean, we use the normal distribution to estimate the likelihood of obtaining a sample mean value as compared to the population of mean values.
In other words, a sampling distribution contains many samples instead of only individual scores.
Use of a sampling distribution allows us to apply the principles of probability to a sample rather than to an individual.
These probability estimates help us determine if a sample mean is highly unlikely.
Using probability to estimate the likelihood of a sample mean is another way to say that we are using inferential statistics, and we use inferential techniques to test hypotheses.
When we use a normal distribution, we rely on the mathematical assumptions that allow us to take samples and draw conclusions or calculate probabilities.

Sampling Distributions (Continued)

However, not all distributions are normal and the question is how can we calculate probabilities from samples if the data are not normally distributed.
One answer to this question is to obtain reasonably large samples.
If we use large samples, we can then rely on another mathematical assumption, the central limit theorem.
The central limit theorem states that as our sample gets larger, the sampling distribution will be approximately normal.
The central limit theorem shows that as the sample size gets larger, the shape of the distribution approaches normality and variability can be standardised, that is, represented by the standard normal distribution.
The central limit theorem also allows us to assume that with a large enough sample, we can use the normal distribution to calculate probability estimates.
In addition to having a standard way of describing variability (the standard deviation is the standardised difference between the mean and an individual score – how far an individual score is from the mean), the central limit theorem allows us to break the distribution into standard increments or percentiles.
We can convert any sample mean into a standard score and place it within the normal distribution.
Thus, we can calculate probabilities, or the likelihood of an event within a population, even when we don’t have all of the data.
Very simply, if the sample size is large enough, we can use the normal distribution to obtain probabilities.

Theoretical Sampling Distributions

Theoretically, each sampling distribution (not a distribution of individual scores, but rather a distribution of means) contains all possible samples for a particular size.
E. g., if we specify a sample size of 5, our sampling distribution comprises all possible samples of 5 scores.
If you take the average of all possible combinations of sample means (for a particular sample size), the obtained value equals the population mean.
We use the assumptions of normality and large sample size to calculate percentiles for sample means based on the population mean.

Standard Error of the Mean

We use the standard deviation for calculating percentile rankings for individual scores, but we need to use a slightly different measure of variability for sampling distributions
When we use sampling distributions, instead of individual scores, we need a measure of variability that is comparable to the standard deviation
The Standard Error of the Mean (SEM) provides an index of the variability of sampling means instead of individual scores
The SEM allows us to estimate a range of expected differences between a sample mean and a population mean
We can use the SEM to calculate the percentile for a mean value, rather than an individual score
We use the SEM to describe how far a sample mean is from a population mean
Calculation of the SEM is: $SEM$ or $M = \frac{\sigma}{\sqrt{N}}$ , where $\sigma$ is the population standard deviation
The SEM is affected by the size of the sample

Single Sample z-test

We can use the normal distribution and the theory of probability to compare a sample mean to a population mean
A single sample z-test is used when we know both the population mean ( $\mu$ ) and the population standard deviation ( $\sigma$ )
Examples of data that can be used with the single sample z-test include IQ scores, GRE’s, SAT’s, height and weight
The single sample z-test allows us to compare a sample mean to a population mean
Let’s consider a example in testing whether a GRE preparation course is useful
Here, we can compare the average GRE score from students completing the preparation course to the GRE population mean score from the standardised test
We can use the z-test because we know the population mean ( $\mu$ = 500) and standard deviation ( $\sigma$ = 100)
We rely on the probability model to determine if our sample mean (or students taking the GRE preparation course) is significantly different from the population mean
Suppose we offer the preparation course to 25 students and find that after completing the course, the sample mean is 540
Next, there are two important questions to ask: (i) is the difference statistically significant, and (ii) is the difference meaningful

Significance Testing with the z-test

The z-test (not a score) allows us to compare a sample mean to a population mean for the purpose of determining the likelihood that the sample mean comes from the same population
The formula for the z-test is $z = \frac{M - \mu}{\sigma_M}$
As you can see from the z-formula, we need to calculate the standard error of the mean ( $\sigmaM$ ) before we can compute the value of z, which is $\sigmaM = \frac{\sigma}{\sqrt{N}} = \frac{100}{\sqrt{25}}$ = 20
The z-test for the sample of 25 students completing the GRE preparation course is $z = \frac{M - \mu}{\sigma_M} = \frac{540 - 500}{20}$ = 2.0
What does this z-value really indicate? Does the difference between the sample and population mean suggest, or infer, a difference that is likely to have occurred because of the preparation course?
We compare our calculated z test value to a z table of probabilities
The z table or Unit Normal table lists all values of z between 0 and 4.00
The above calculated z value of 2.0 suggests that approximately 2% of the distribution is located beyond this point
In other words, we would end up with a mean as high as 540 only about 2% of the time
This means, a group that had not taken a preparation test would have a mean as high as 540 only about 2% of the time
Thus, 98% of the time, such a group would have a mean score below 540

Conclusion from the z-test: GRE Preparation Group

It is reasonable to conclude that a mean of 540 is unusual enough that we should conclude that the GRE preparation group is not like the original population
In other words, the preparation group comes from a different population that scores higher than 500

Steps in Hypothesis Testing

From a practical standpoint, we do not use the z-test very often
So, we use the z-test primarily to illustrate the concept of hypothesis testing
We compare our calculated z test value to a z table of probabilities
The z-test not only allows us to use probability to test the hypothesis, but it also provides us with a direct measure of effect size
Regardless of the statistical test that we might use, we follow the same basic procedure for hypothesis testing

Step 1: State the Hypothesis

Using the research or alternative hypothesis, we hypothesize that our sample of students participating in the GRE preparation course is likely to have higher scores
We also state a null or statistical hypothesis, i.e., there would be no difference between the sample and population mean GRE scores
Although we don’t typically use this hypothesis when writing results, we must specify this hypothesis for the purpose of statistical testing
The statistical hypothesis provides the basis for testing whether our GRE preparation course results in an effect that is statistically unlikely to occur in the absence of a treatment

Step 2: Set Alpha

We set alpha to reflect the point at which we conclude that the results are unlikely to occur if the treatment did not have an effect
When we test hypotheses, we must specify what we consider to be an unlikely outcome
In other words, we specify when the sample is so unlikely that we no longer believe that the treatment was ineffective
Researchers typically use the conventional value of .05 or $\alpha$ = .05 and almost always use what we call a two-tailed hypothesis test
When we use a two-tailed test, the 5% or alpha, is split between the two tails of the distribution
Our alpha level is therefore derived by obtaining the z-score associated with 2.5% (.025) of each tail or z = 1.96
If we obtain a sample mean located in this alpha area, then we conclude that our sample is unlikely (only a 5% chance) to be a part of the larger untreated population of GRE scores

Step 3: Calculate the Statistic

We specify our hypothesis and alpha level prior to calculating a statistic
Then, we calculate the single sample test or z-test to determine if the GRE preparation course affected the GRE scores
If we conclude that the result is statistically unlikely, and we reject the statistical hypothesis, then our research hypothesis is true and the course affected the scores

Step 4: Compare calculated to tabled z-score

We determine this by comparing our calculated z-score to the tabled z-score
If our calculated z-score is larger than the specified alpha value (z = 1.96) located in the table, we conclude that our calculated value is unlikely to have occurred if the GRE preparation group, or sample, was from the same group as the original population, and that our findings are statistically significant
However, statistical significance does not always mean that the effect is practically significant
It is possible to find a statistically unlikely occurrence or statistical significance that is not meaningfully different
E. g., how much change in a GRE score would be necessary in order for you to consider the improvement in the score to be meaningful?
In order to fully understand what amount of change is meaningful, we specify an effect size

Step 5: Calculate and Report Effect Size

Effect size reflects the estimated standardised difference between two or more means.
The calculated z score is our measure of effect size which is z = 2.00
In our example, we are comparing a sample mean from students completing a GRE preparation course to the overall population mean of GRE scores
So, the sample of students taking the GRE preparation course performed two standard deviations above the population mean

Completion of Hypothesis Test

The steps for conducting a hypothesis test produce a test statistic (z value), a corresponding probability value, and an effect size
We obtain the probability value by using the calculated z (or t statistic) to obtain the probability value from the respective statistical table
We report each of these values as a way of scientifically communicating our results
We will use these same steps for subsequent inferential tests

Significance vs. Meaningful Difference

Although we know that the obtained sample mean is unlikely to occur in a population whose mean is 500, our second question is perhaps even more important
Is the obtained sample mean for the students taking the GRE preparation course meaningfully different from the population mean GRE scores?
In order to answer this question, we need to calculate an effect size
The z value is already a standardised difference between means, so it provides us with a direct measure of effect size
Hence, our z score of 2.0, located at the 98th percentile, suggests that as a group the students taking the GRE preparation course outscored the general population by 98%
That is, if we studied groups of 25 students who did not take a GRE preparation course, most of the time the mean for a group would be around 500
The large effect size suggests that the difference is meaningful; the effect is also meaningful in the sense that the cost of the change in the GRE score justify the cost of the GRE prep. course

Single Sample t-test

We can only use the z-test if we know the population mean and standard deviation
However, usually we do not know the parameters (i.e., population mean and standard deviation) associated with a characteristic that we are interested in examining
E. g., perhaps we are interested in examining opinions about a political candidate
We can use a Likert scale ranging from 1 (Disapproval) to 5 (Approval) to obtain a rating of the candidate
In this case, because we don’t know the population mean and standard deviation; we don’t know the (real) average approval rating of the candidate
Instead, we define the population mean (average approval rating) as equal to 3 because this is a neutral value indicating ambivalence (uncertainty, unsureness)
If we collect data from a group of individuals, we can generate a sample mean approval rating and compare this to the defined neutral rating or population mean
We can also calculate an estimate of the standard deviation
We can also conduct an inferential test to determine if the approval rating is significantly different from a neutral value
We specify the population mean of 3 (above) as the midpoint in the Likert scale, as we want to compare the average reported approval rating to the specified neutral value or population mean for the Likert scale

t-distribution

This opinion questionnaire is not a standardised measure, so we cannot assume the distribution is normal and also do not know the standard deviation
Because we cannot assume that the data are normally distributed, we must use a different sampling distribution to make our comparison
A t-distribution is one such sampling distribution that is quite similar to the normal distribution
We can use the t-distribution with small samples whose scores resemble normal distributions, but that aren’t quite normal
The distinction between the error term that we use in a z-test and the error term that we use to calculate the t-test is very important
When we calculate error in the z-test, we make this calculation based on a known population standard deviation
However, when calculating the t-test, we do not know the population standard deviation
Instead, we must use an estimated standard deviation for our calculation of the t-test
Because our standard deviation is estimated, it is probably not as accurate ; in other words, our estimated standard deviation introduces an element of inaccuracy
So, rather than using the sample size (N) to calculate the SEM, we use an adjusted sample size (N – 1) to calculate an estimated standard error of the mean
In other words, not $\sigmaM = \frac{\sigma}{\sqrt{N}}$ ; but $sM = \frac{s}{\sqrt{N}}$

t-test

Hence, when we estimate a sample SEM ( $s_M$ ), we use the N – 1 (called degrees of freedom) calculation to obtain the estimated standard deviation as illustrated below:
Instead of $s = \sqrt{\frac{\sum(X - M)^2}{N}}$ ; it is $s = \sqrt{\frac{\sum(X - M)^2}{N - 1}}$
So, the degrees of freedom are taken into account when we calculate the standard deviation before we calculate the standard error of the mean
When we use a t-test, we rely upon assumptions of the t-distribution to derive our estimated probability
The t-distribution, or more accurately the family of t-distributions, is based upon the size of the sample
So, if we obtain a very large sample, the t-distribution approximates the normal distribution
Using the t-formula, allows us to compare a single sample to a population for which we do not know the standard deviation
That is $t = \frac{M - \mu}{s_M}$
In our example of voter opinions, we can use the single sample t-test to examine approval ratings for a political candidate
We compare the neutral value (M = 3 from the Likert scale) to a sample mean (M = 4 obtained from our voters)
Thus, we compare the population mean of 3 to the sample mean of 4 to ascertain if the voters express a statistically significant variation in approval rating

Calculation of the t-test

In order to complete the t-test calculation, we must begin by calculating basic descriptive statistics; the mean and the standard deviation
Step 1: using the M = 4 for a set of scores where N = 10, we may now calculate the standard deviation for this sample of scores: $s = \sqrt{\frac{\sum(X - M)^2}{N - 1}}$
$s = \sqrt{\frac{10}{9}} = \sqrt{1.11}$ = 1.05
Step 2: in our second step we use the standard deviation value to calculate the estimated standard error of the mean
$s_M = \frac{s}{\sqrt{N}} = \frac{1.05}{\sqrt{10}} = \frac{1.05}{3.16}$ = 0.33
Step 3: in our third step, we complete calculation of the single sample t-test using our estimated standard error of the mean and the differences between means, which is
$t = \frac{M - \mu}{s_M} = \frac{4 - 3}{.33} = \frac{1}{.33}$ = 3.03
In order to determine if this calculated value is significant, we need to compare this value to another table, the t-Distribution Table (Table E.2)
We use degrees of freedom (df) = 9 to find the tabled t-value because N – 1 = 9 in this example
Degrees of freedom (df) are the number of values that are free to vary when using an inferential statistic
If we use an alpha level of .05 and our degrees of freedom (df = 9), we find that the tabled value is 2.262
This means that our calculated t-test value needs to be larger than 2.262 to be considered statistically significant
Our calculated value of 3.03 is larger than 2.262, so we conclude that the results are statistically significant
Thus, we have enough evidence to conclude that our group of 10 people departs from a neutral rating of the candidate

Results of t-test

The following paragraph conveys information that allows us to understand results
Ten people responded to an approval rating of a political candidate. We obtained equal proportions of men (n = 5) and women (n = 5). Respondents rated the candidate on a Likert scale ranging from 1 (Disapproval) to 5 (Approval). A rating of 3 indicated a neutral opinion. We compared the average rating to the neutral value in order to determine if our participants differed significantly from the average. Respondents generally expressed approval (M = 4, SD = 1.05) of the candidate. The average approval rating was significantly different from average, t(9) = 3.03, p < .05.
In this example we used the single sample t-test to examine whether an opinion was different from a neutral value

Learning outcomes

Learning Outcomes for Lecture Unit 6 – Inferential Statistics
Upon completion of this lecture unit you should be able to:
- Elucidate the meaning of inferential statistics.
- Explain the meaning of statistical significance.
- Describe what is meant with a Type I error and a Type II error.
- Athletes are often tested for use of performance enhancing substances. If we consider this testing in the context of hypothesis testing, given what you know about possible outcomes of these tests, consider which error would be most desirable. Provide a rationale for why you would opt for a Type I or Type II error.
- Clarify the meaning of statistical power.
- Illuminate the relationship between standard deviation and the normal distribution.
- Simplify how the standard error of the mean compares to the standard deviation.
- Shed light on the relationship between standard error of the mean and sample size.
- Differentiate the z-score and the z-test.
- Spell out how the z-score is related to the normal distribution.
- Delineate the meaning of the standard error of the mean (SEM).
- Reveal the steps used in hypothesis testing.
- Explicate the meaning of effect size.
- Produce a description of the t-test using your own words.