Study Notes on Sampling Distributions and Standard Error

Sampling Distributions and Standard Error

Content is based on the textbook "Statistics, Biology, and R" by Arndt F. Laemmerzahl.

Objective: Estimate average reaction time for adults at a concussion clinic.
Population: All eligible adults seen at the concussion clinic during the study period under the same reaction-time testing protocol. This is the group from which we intend to draw conclusions.
Sample: A randomly selected group of 25 adults tested under the same protocol.
Sample Size (n): 25
Example Measurement: Average reaction time from the sample is 362 ms (milliseconds).
Distribution Characteristics: The sample distribution of individual reaction times is right-skewed (the right tail is longer). A few slower reaction times pull the distribution to the right.

Second Sample (Sample B): Another random sample of 25 adults, average reaction time is 315 ms.
Distribution Characteristics of Sample B:
- Distribution appears closer to normal.
- Possible explanations:
- Random sampling variation: Each sample consists of different individuals; thus, the shapes and means can differ even if drawn from the same population.
- Skew and Outliers: In cases of right-skewed data, certain very slow responses can artificially inflate the average.
- Sample B has a more balanced mix of faster and slower times around its mean, leading to a more symmetric histogram.

Definitions:
- Sample Distribution: Refers to the distribution of individual observations within one sample (in this case, the 25 reaction times).
- Sampling Distribution: Consider the sample means rather than individual observations.
Constructing Sampling Distribution:
- By repeatedly taking random samples of size 25 and calculating their means, you can create a histogram of sample means.
Purpose of Focusing on Sample Means:
- Estimate population average reaction time (µ) directly using the sample mean (𝑦̅).
- Sample mean provides a summary that smooths variability from individual data points.
Sampling Distribution Behavior:
- Initially, with 12 samples, the distribution of sample means did not appear normal due to insufficient samples.
- Increasing the number of samples to 120 starts to show a bell-shaped distribution.
- Further increase to 15,000 samples continues to shape the distribution more clearly.

Key Takeaways:
- A single sample can differ markedly from the population mean (µ) due to random sampling variations including shape, center, and outliers.
- Collecting more samples stabilizes the distribution of sample means, centering it close to µ and revealing the underlying distribution without altering it.
- This behavior underpins the Central Limit Theorem.

Definition: Represents how accurately a sample mean estimates the true population mean.
Formula for Standard Error:
SE = rac{ ext{s}}{ ext{n}} where s is the sample standard deviation and n is the sample size.
Distinguishing Standard Deviation from Standard Error:
- Standard Deviation (SD): Represents the spread of individual data points within a sample relative to the sample mean.
- Standard Error (SE): Reflects the variation of sample means across multiple samples relative to the population mean.
Estimation: Since true values of µ and σ are often unknown, we use estimates; substituting: SE = rac{ar{Y} - ext{µ}}{ ext{s} / ext{n}} .

Interpretation of SE:
- A small SE indicates a reliable estimate of µ by the sample mean.
- A large SE suggests poor estimation of µ.
Limitations of SE:
- SE does not provide information on the spread of individual data points in the sample (which is captured by standard deviation).

Effect on 𝑦̅:
- Larger sample sizes lead to better representation of the population, resulting in means closer to the true population mean.
- Conversely, smaller samples fluctuate more, often leading the sample mean further from true µ as sample size decreases.
- As n approaches infinity, the relationship converges: ar{Y} o ext{µ} .
Effect on σ:
- σ remains a constant value, representing the overall population standard deviation regardless of the number of samples taken. However, taking more samples leads to a better estimate of σ.
Effect on s (Sample SD):
- Sample SD s changes with sample size.
- Larger samples yield a sample SD that more closely represents the population SD.
- Conversely, smaller samples yield larger sample SDs that don't represent the population as well.
Effect on SE Based on Sample Size:
- As the sample size increases, SE = rac{s}{n} decreases, leading to a more reliable estimate of the population mean.
Specific Example:
- Calculating SE for n = 50, SE = 4.67 / 50 = 0.6604.
- For n = 100, SE = 4.67 / 100 = 0.467.

Gene Expression Study:
- True average (µ) = 150 units; σ = 30 units.
- Impact of increasing sample size from 10 to 100: SE decreases.
Fish Species Length Study:
- Initial sample of 20 fish vs. 200 fish—fluctuations in sample mean observed.
Deer Body Weight Study:
- Increasing sample size from 50 to 1,000 won't change σ, which is fixed.
Cholesterol Level Study:
- Population mean (µ) remains unchanged even if sample size is increased or decreased.
Height of Trees Study:
- Probability questions are framed around z-scores to assess likelihood.
  - Example: Finding the probability of sample mean being less than 70 meters with calculated z-scores.

mRNA Levels Calculation:
- Calculate probabilities for various scenarios based on true mean (μ=200) and standard deviation (σ=25) mRNA levels across 40 samples:
  a) Pr{ 𝑌̅ > 210 } = 0.0057
  b) Pr{ 𝑌̅ < 187 } = 0.0005 c) Pr{ 192 < 𝑌̅ < 207 } = 0.9399 d) Pr{ 𝑌̅ > 207 OR 𝑌̅ < 192 } = 0.0601