Applied Business Statistics: Hypothesis Testing - Single Sample t-Test

Single Sample t-Test Fundamentals

Type of Data:- One variable, which must be either quantitative or dichotomous (coded as $0/1$ ). A dichotomous variable will yield a proportion.
Purpose/Use:- To compare a sample mean to some established standard or assumed population value.
- Example: Comparing a survey response mean to a neutral value of $3$ on a $1-5$ scale.
Equation (t-statistic):- $t = rac{\bar{x} - \mu}{\sigma_{\bar{x}}}$
- This equation is identical to that for a z-score of a mean within a sampling distribution. It standardizes the sample mean.
- $\bar{x}$ : Sample mean
- $\mu$ : Population mean (the standard or assumed value)
- $\sigma_{\bar{x}}$ : Standard error (the standard deviation of the sampling distribution).
**Practical Tools & Alternatives:- Excel can be used, but requires a "trick" by utilizing the paired t-test function for a single-sample scenario. A dedicated Excel video in the library demonstrates this.
- Confidence Interval Alternative: Instead of a t-test, a confidence interval of the sample mean can be used. If the hypothesized population mean falls outside of this confidence interval, then the null hypothesis can be rejected.

Cereal Filling Machine Example: Introduction

Scenario: A cereal filling machine's specifications indicate an average fill of $32 \text{ oz}$ . The question is whether the machine needs recalibration.
Observation: A sample is taken, yielding an average fill of $32.2 \text{ oz}$ .
Initial Question: Is this sample mean $(32.2 \text{ oz})$ "unusual" when compared to the specified $32.0 \text{ oz}$ ?- Caution against Intuition: While a $0.2 \text{ oz}$ difference might seem insignificant when thinking about individual cereal boxes, we are comparing the mean of a sample (which could be from $10$ , $50$ , $100$ , or $1000$ boxes). The sampling distribution derived from sample means has a much smaller variance than the original population of individual boxes. Therefore, formal statistical tools are necessary to determine if the $0.2 \text{ oz}$ difference is a real deviation or merely sampling error.
**Hypotheses Formulation (Always about the Population Mean, $\mu$ ):- Null Hypothesis ( $H_0$ ): Based on the initial or historically assumed population mean ( $\mu$ ). It posits that the current mean is equal to the historic mean. - In this example: $\mu = 32.0 \text{ oz}$ (The machine is filling correctly according to specs).
```

```
- Alternative Hypothesis ( $H_a$ ): The claim we are trying to substantiate, which is the opposite of the null hypothesis. It typically represents a deviation from the historic mean.- In this example: $\mu \neq 32.0 \text{ oz}$ (The machine is not filling correctly, meaning it's either over- or under-filling).

Inferring About a Single Population Mean: The Process

Objective: To determine if the cereal filling machine is operating correctly in terms of its average fill.
**Steps for Inference:1. Collect a Sample: Gather $n$ cereal boxes and accurately measure the ounces of cereal in each.
1. Compute Sample Statistics: Calculate the sample mean ( $\bar{x}$ ) and sample standard deviation ( $s$ ) from the collected measurements.
2. Address Sampling Error: A sample mean will almost never be exactly the population mean (e.g., $32 \text{ oz}$ ) even if the machine is working perfectly, due to inherent sampling error. The key is to distinguish between a "real" or "significant" difference and one solely attributable to sampling error.
3. Utilize the Sampling Distribution: To determine "how likely" our obtained sample mean is, we invoke the concept of the sampling distribution.- The t-test specifically examines the sampling distribution assuming the null hypothesis ( $\mu = 32$ ) is true.
  - This distribution represents all possible sample means that could be obtained if the true population mean were $32 \text{ oz}$ .
4. Calculate Probability: Using this sampling distribution, we calculate the probability of observing a sample mean as extreme as, or more extreme than, our actual sample mean ( $\bar{x} = 32.2 \text{ oz}$ ), assuming the null hypothesis holds.
5. Decision Making: We then ask: Is this calculated probability sufficiently small to conclude that it is unlikely the null hypothesis is true? If so, we reject $H_0$ .

Characteristics of the Sampling Distribution (Assuming $H_0$ is True)

Purpose: This distribution quantifies the expected sampling error and helps identify sample means that are "usual" versus "unusual" under the assumption that the null hypothesis ( $H_0$ ) is true.
Comparison: Our obtained sample mean ( $\bar{x}$ ) is compared against this hypothetical sampling distribution.
**Key Characteristics:1. Mean of the Sampling Distribution ( $\mu_{\bar{x}}$ ):- Is equal to the mean of the population ( $\mu$ ).
- Under the assumption that $H0$ is true, $\mu{\bar{x}} = \mu = 32 \text{ oz}$ .
```

```
1. Standard Error ( $\sigma_{\bar{x}}$ ):- This is the standard deviation of the sampling distribution.
  - Formula: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$
  - Practical Estimation: Since the true population standard deviation ( $\sigma$ ) is rarely known, it is estimated using the sample standard deviation ( $s$ ) from our data, especially when the sample size is large enough. Thus, $\sigma_{\bar{x}} \approx \frac{s}{\sqrt{n}}$ . This demonstrates that larger sample sizes ( $n$ ) lead to smaller standard errors, implying that sample means are expected to be closer to the population mean.
2. **Shape of the Sampling Distribution:- For $n \geq 30$ (Central Limit Theorem): If the variable is continuous and the sample size is $30$ or greater, the sampling distribution will be approximately normal (and becomes more normal as $n$ increases).
  - For n < 30 (Small Samples): An additional assumption is required: the original population distribution must be normal. If this assumption holds, the sampling distribution will also be normal.
  - Simplification with t-distribution: For practical purposes, the t-distribution can always be used. It inherently accounts for the increased variability in smaller samples, and for larger sample sizes, it asymptotically converges to the normal (z) distribution, becoming nearly identical. This eliminates the need to decide between z and t-distributions based on sample size.

Performing the t-Test: Cereal Machine Example

**Sample Data After Collection (Assume $n = 100$ boxes):- Sample Mean: $\bar{x} = 32.16 \text{ oz}$
- Sample Standard Deviation: $s = 0.70 \text{ oz}$
- Sample Size: $n = 100$
Calculate the Standard Error ( $\sigma{\bar{x}}$ ):***- $\sigma{\bar{x}} \approx \frac{s}{\sqrt{n}} = \frac{0.70}{\sqrt{100}} = \frac{0.70}{10} = 0.07 \text{ oz}$
Determine Sampling Distribution Shape: Since $n = 100$ (which is $\geq 30$ ), the sampling distribution is considered normal. We will use the t-distribution for calculating probabilities, as it is appropriate for all sample sizes.
The Critical Question Rephrased for Decision-Making: Is the observed sample mean ( $\bar{x} = 32.16 \text{ oz}$ ) different from the hypothesized population mean ( $\mu = 32 \text{ oz}$ ) merely due to random sampling error (meaning the equipment is working correctly), or is it a "significant" difference indicating the equipment is not working as intended and requires recalibration?- "Significant" here means the difference is substantial enough that it suggests the true population mean is no longer $32 \text{ oz}$ .

Defining Hypotheses and Significance Level

**Formal Hypotheses for the Cereal Machine:- Null Hypothesis ( $H_0$ ): $\mu = 32 \text{ oz}$ (The machine is filling neutrally/correctly).
- Alternative Hypothesis ( $H_a$ ): $\mu \neq 32 \text{ oz}$ (The machine is either over- or under-filling; this is a two-tailed hypothesis).
Question of Likelihood: Is it likely that our sample mean of $32.16 \text{ oz}$ comes from a population where the true mean is $32 \text{ oz}$ ?
Again, resist using intuition based on small numerical differences; the comparison is between sample means, not individual values.
Defining "Unlikely" (Significance Level, $\alpha$ ):- We establish a threshold probability, $\alpha$ , to determine what constitutes an "unlikely" event.
- By convention, $\alpha$ is typically set at $0.05$ (or $5\%$ ).
- Decision Rule: If the probability of obtaining our sample mean (or a more extreme one) is less than $\alpha$ , we will reject the null hypothesis ( $H0$ ) in favor of the alternative hypothesis ( $Ha$ ).
```

```
- Graphical Representation of Rejection Regions:- The sampling distribution is centered on the hypothesized population mean ( $\mu = 32 \text{ oz}$ ), or $0$ for standardized t-scores.
- Since $H_a$ is two-sided ( $\mu \neq 32$ ), the total "unlikely" probability ( $\alpha$ ) is split equally into two rejection regions in the tails of the distribution.
- Each tail represents $\alpha/2 = 0.05 / 2 = 0.025$ (or $2.5\%$ ) of the distribution.
- If our calculated t-value falls into the upper tail, we reject $H0$ and conclude overfilling. If it falls into the lower tail, we reject $H0$ and conclude underfilling.
Understanding Type I Error:- It is possible, though unlikely (with a probability equal to $\alpha$ ), that our sample mean falls into a rejection region even when the null hypothesis ( $\mu = 32$ ) is actually true. If we reject $H_0$ in this scenario, we commit a Type I error (incorrectly rejecting a true null hypothesis).
**Calculating the t-value and Making a Decision:- Using the formula: $t = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}} = \frac{32.16 - 32}{0.07} = \frac{0.16}{0.07} \approx 2.2857$
- The lecture states the calculated t-value is $2.33$ .
- Decision: A t-value of $2.33$ falls into the right-hand side rejection area (the upper $2.5\%$ tail). This signifies that the probability of obtaining a sample mean of $32.16 \text{ oz}$ (or more extreme), given a sample size of $100$ and a true population mean of $32 \text{ oz}$ , is less than $5\%$ (our $\alpha$ ).
- Conclusion: Despite the small apparent difference, the large sample size makes the deviation significant. We reject the null hypothesis that the machine is working properly and conclude it is likely overfilling, requiring recalibration.

Excel Output and Key Statistics

Excel Single-Sample t-Test via Paired t-Test Trick: Excel's direct single-sample t-test is not readily available, so the "Paired Two Sample for Means" function is adapted. The output title will reflect this, but it is effectively performing a single-sample test.
**Dissecting Excel Output:- Sample Mean (e.g., "Fill"): Shows the calculated mean from your sample (e.g., $32.163$ ).
- Hypothesized Mean Diff (e.g., "Hypo=32"): Represents the population mean assumed under the null hypothesis (e.g., $32$ ).
- Observations ( $n$ ): The sample size (e.g., $100$ ).
- Degrees of Freedom ( $df$ ): Calculated as $n-1$ . For $n=100$ , $df=99$ . While the exact meaning is complex, remembering it as $n-1$ is sufficient for practical application.
- t Stat (t-value): The calculated t-statistic (e.g., $2.32623$ ).- This value standardizes the sample mean, indicating how many standard errors it is away from the hypothesized population mean ( $\mu = 32$ ). For our example, $2.33$ means the sample mean is more than two standard deviations above the hypothesized mean under the null hypothesis.
- P(T<=t) two-tail (p-value): The probability value for a two-tailed test (e.g., $0.022079$ ).- This is a crucial value for decision-making.
- t Critical two-tail: The critical t-value from the t-distribution table for the specified $\alpha$ level and degrees of freedom (e.g., $1.984217$ ).- While provided, the importance of using t-critical values is diminished when reporting p-values, which directly give the probability.

The Meaning and Use of the p-value

Definition of p-value:- The p-value is the probability of obtaining a sample mean as extreme as, or more extreme than, the observed sample mean ( $\bar{x} = 32.16$ ), solely by random sampling error, assuming that the null hypothesis ( $\mu = 32$ ) is true.
- In the sampling distribution graph, this corresponds to the sum of the areas in both shaded tails (for a two-tailed test).
Decision Rule Using p-value:- If p-value < \alpha : The observed sample mean is considered "unusual." In this case, we **reject the null hypothesis** ( $H0$ ).
- Conclusion: The mean of the machine is significantly different from $\mu = 32$ . - Example: Our p-value of $0.022$ is less than the typical $\alpha = 0.05$ . Therefore, we reject $H0$ . The difference is statistically significant.
```

```
- If p-value $\geq \alpha$ : We **fail to reject the null hypothesis** ( $H_0$ ).- Conclusion: There is not enough evidence to claim a significant difference.
Context of a Two-Tailed Test:- Our alternative hypothesis, $H_a: \mu \neq 32$ , is non-directional (it states the mean is different, not specifically higher or lower).
- Therefore, the rejection regions (and the p-value from the Excel output, "P(T<=t) two-tail") are calculated symmetrically across both tails of the sampling distribution.
- For directional (one-tailed) hypotheses, a different p-value calculation (one-tail) would be used. The lecture graphic illustrates the two blue shaded tails representing the two-tailed p-value.

One-Tailed vs. Two-Tailed Hypotheses

Two-Tailed Test (Non-Directional):
- Used when the alternative hypothesis ( $H_a$ ) states that the population mean is simply different from the hypothesized value (e.g., $\mu \neq 32 \text{ oz}$ ).
- Rejection regions are split into two tails of the sampling distribution, each containing $\alpha/2$ probability.
- Provides evidence for deviation in either direction (greater or smaller).
One-Tailed Test (Directional):
- Used when the alternative hypothesis ( $H_a$ ) specifies a direction for the difference.
- Upper-Tailed Test: Ha: \mu > \mu0 (e.g., \mu > 32 \text{ oz} for overfilling). The entire $\alpha$ probability is placed in the upper tail.
- Lower-Tailed Test: Ha: \mu < \mu0 (e.g., \mu < 32 \text{ oz} for underfilling). The entire $\alpha$ probability is placed in the lower tail.
- P-value for One-Tailed Tests: If using Excel's "P(T<=t) two-tail", for a one-tailed test, you would typically divide the two-tailed p-value by 2, if the t-statistic falls in the hypothesized direction. Otherwise, if the t-statistic falls in the opposite direction, the p-value would be > 0.5 .

Assumptions of the Single Sample t-Test

For the conclusions from a t-test to be valid, certain assumptions about the data must be met:

Random Sampling: The sample must be drawn randomly from the population. This ensures the sample is representative and that generalizations to the population are valid.
Independence of Observations: Each observation (data point) in the sample must be independent of every other observation. This means the measurement for one cereal box does not influence the measurement for another.
Measurement Scale: The dependent variable (the variable being measured, e.g., cereal fill in ounces) must be measured on an interval or ratio scale.
Normality of the Population Distribution:
- If n < 30 (Small Sample): The population from which the sample is drawn should be approximately normally distributed. If this assumption is severely violated, the t-test results may not be reliable.
- If $n \geq 30$ (Large Sample - Central Limit Theorem): Due to the Central Limit Theorem, the assumption of population normality becomes less critical, as the sampling distribution of the mean will tend towards normality regardless of the population's shape. This is why the t-distribution is robust to violations of normality with large sample sizes.
- Practical Note: The t-test is generally robust to moderate violations of normality, especially with larger sample sizes. However, extreme skewness or outliers can still affect results.

Reporting Single Sample t-Test Results

When reporting the results of a single-sample t-test, it's important to include key information:

The Hypotheses: State the null ( $H0$ ) and alternative ( $Ha$ ) hypotheses clearly.
Sample Statistics: Report the sample mean ( $\bar{x}$ ), sample standard deviation ( $s$ ), and sample size ( $n$ ).
Test Statistic: Report the calculated t-value, degrees of freedom ( $df$ ), and the p-value.
- Example: "A single-sample t-test revealed a significant difference in cereal fill from the hypothesized mean; $t(99) = 2.33$ , $p = 0.022$ .
Significance Level: State the chosen significance level ( $\alpha$ ).
Conclusion: Based on the p-value and $\alpha$ , state whether the null hypothesis was rejected or failed to be rejected, and interpret this finding in the context of the research question.
- Example for cereal machine: "Based on the t-test ( $t(99) = 2.33$ , $p = 0.022$ ), with $\alpha = 0.05$ , we reject the null hypothesis. The cereal filling machine is likely overfilling and requires recalibration, as the sample mean ( $32.16 \text{ oz}$ ) was significantly higher than the specified $32.0 \text{ oz}$ .

Statistical Power and Type II Errors

Type I Error (Alpha, $\alpha$ ):
- Already discussed: The probability of incorrectly rejecting a true null hypothesis. Set by the researcher (e.g., $\alpha = 0.05$ ).
Type II Error (Beta, $\beta$ ):
- The probability of incorrectly *failing* to reject a false null hypothesis. This means we miss a real effect or difference that exists in the population.
- Example: Concluding the cereal machine is filling correctly when in reality it *is* overfilling (or underfilling).
Statistical Power ( $1 - \beta$ ):
- The probability of correctly rejecting a false null hypothesis. Essentially, it's the probability of finding a statistically significant result when there truly is an effect to be found.
- A higher power is desirable, typically aimed for $0.80$ (or $80\%$ ).
- Factors influencing power:
  - Sample Size ( $n$ ): Larger samples generally lead to higher power (due to smaller standard error).
  - Effect Size: Larger true effects in the population are easier to detect, thus increasing power.
  - Alpha Level ( $\alpha$ ): Increasing $\alpha$ increases power (but also increases Type I error risk).
  - Variability: Less variability (smaller standard deviation) in the population increases power.

Effect Size

Beyond Statistical Significance:
- A statistically significant result (small p-value) indicates that an observed difference is unlikely due to random chance. However, it does *not* tell us about the practical importance or magnitude of that difference.
- Example: A $0.1 \text{ oz}$ difference in cereal fill might be statistically significant with a very large sample size ( $n=10000$ ), but it might not be practically meaningful for the consumer or the company.
Definition:
- Effect size measures the *magnitude* or *strength* of a phenomenon. For t-tests, it quantifies the difference between the sample mean and the hypothesized population mean in a standardized way.
Cohen's d (Common Effect Size for t-tests):
- Formula: $d = \frac{\bar{x} - \mu}{s}$ (or estimated using pooled standard deviation for other t-tests).
- $\bar{x}$ : Sample mean
- $\mu$ : Hypothesized population mean
- $s$ : Sample standard deviation (or population standard deviation if known)
- Interpretation of Cohen's d (general guidelines, context dependent):
  - $d \approx 0.2$ : Small effect
  - $d \approx 0.5$ : Medium effect
  - $d \approx 0.8$ : Large effect
Cereal Machine Example - Calculating Cohen's d:
- Using the values: $\bar{x} = 32.16 \text{ oz}$ , $\mu = 32.0 \text{ oz}$ , $s = 0.70 \text{ oz}$
- $d = \frac{32.16 - 32.0}{0.70} = \frac{0.16}{0.70} \approx 0.228$
- Interpretation: This is a small effect size. While the t-test found a statistically significant difference (p = 0.022), the practical difference of $0.16 \text{ oz}$ is relatively small compared to the overall variability ( $0.70 \text{ oz}$ ) in fills. This might prompt further investigation into the cost-benefit of immediate recalibration versus the cost of a small overfill.

Conclusion and Practical Implications

After performing a single-sample t-test, the decision-making process involves a synthesis of statistical significance (p-value, Type I error) and practical significance (effect size, Type II error, and power).

Statistical Significance (p-value): If the p-value is less than the chosen alpha level ( $\alpha$ ), we reject the null hypothesis, concluding that the observed sample mean is statistically different from the hypothesized population mean due to something other than sampling error.
Practical Significance (Effect Size): Beyond statistical significance, the effect size provides a measure of the magnitude of the observed difference. A statistically significant result might not always translate to a practically important effect, especially with very large sample sizes.
Decision-Making Framework:
- If the p-value is small (e.g., < 0.05 ) and the effect size is considered meaningful (e.g., Cohen's d is medium or large, or practically relevant), then there is strong evidence to reject the null hypothesis and take action where appropriate.
- If the p-value is small but the effect size is negligible, the statistical difference may not warrant practical action. For instance, in the cereal example, a $0.16 \text{ oz}$ overfill is statistically significant but might be deemed acceptable given the cost of recalibration, depending on business priorities.
- If the p-value is large (e.g., $\geq 0.05$ ), we fail to reject the null hypothesis. This means there is not sufficient evidence to claim a difference from the hypothesized mean. It's crucial to consider the statistical power in this scenario: a lack of significance could be due to a true lack of effect or insufficient statistical power to detect an existing effect.
Continuous Improvement: Statistical tests like the single-sample t-test are valuable tools for ongoing process monitoring and quality control. Regular sampling and analysis can help detect deviations early, leading to timely adjustments and maintaining product quality or service standards.