JZ

Chapter 1-6: Key Vocabulary in Normal Distribution, Sampling, and Inference

Context and class setup

  • Overall tone: today’s class is intentionally less flashy—aimed at building foundational ideas in a dry but important area. If the pace feels slow, trust that future topics will build on this and feel more engaging.

Core idea: normal distribution and the role of z-scores

  • Normal distribution is central because many statistics rely on normal probabilities for inference.

  • Instead of focusing on the probability of a single value, we work with probabilities related to distributions (e.g., ranges, percentiles).

  • Z-scores provide a standardized way to compare any normal distribution to the standard normal distribution (mean 0, standard deviation 1).

  • Key properties of a z-score:

    • Sign of z indicates whether the raw score is above or below the mean.

    • Magnitude of z indicates how far the score is from the mean in standard deviation units.

  • Z-score formula (for a raw score X from a population with mean μ and standard deviation σ):
    z = \frac{X - \mu}{\sigma}

  • Once you have a z-score, you can determine probabilities from the standard normal distribution, or find the raw score corresponding to a given probability (quantiles).

Transforming raw scores to z-scores and back

  • Example setup: measure a variable (e.g., height or annual income) for a group.

  • A raw score example: Tony’s annual income (X). You can convert X to a z-score to compare against the population distribution:

    • If μ is the population mean and σ is the population standard deviation, then z = (X - \mu) / \sigma.

  • Interpretability of z-scores:

    • Positive z → above the mean; negative z → below the mean.

    • The magnitude |z| tells how many standard deviations away from the mean the value is.

  • Practical aspect: z-scores enable you to turn a normal distribution into a common scale for probability calculations and for comparing different variables.

Two main directions in probability and inference

  • From value to probability (cumulative): given X (or a z-score), find P(X ≤ x) or P(Z ≤ z).

  • From probability to value (quantiles): given a probability p, find the corresponding X (or z) such that P(X ≤ x) = p.

  • In Excel or similar tools, you can compute these in two ways (e.g., using normal distribution functions or inverse functions) and then interpret the results in terms of the actual scale of your variable.

Worked examples and intuition from the lecture

  • Example 1: business context – assume population mean μ and standard deviation σ for annual income; find probability that a random student earns at most Tony’s income or at least as much as Tony.

    • If you know Tony’s X, you compute z and then get P(X ≤ Tony) from the normal distribution.

    • If you want the 50th percentile (median) of a normal distribution with mean μ and sd σ, the percentile corresponds to X = μ, since the standard normal median is 0.

    • Example note: with a normal distribution having μ = 4 and σ = 0.5, the 50th percentile is X = 4.

  • Example 2: time-to-complete a task (e.g., programming task or report writing):

    • Given mean time μ = 4 hours and standard deviation σ (e.g., 0.5 hours), compute probabilities like P(X > 4.5).

    • Also compute P(3 ≤ X ≤ 5) by finding P(X ≤ 5) − P(X ≤ 3).

  • Practical takeaway: these kinds of questions help with planning and scheduling in a team setting (e.g., predicting task durations).

Rough approximation vs. normal (two methods) for probabilities

  • Two methods introduced for solving these questions:

    • Rough approximation (intuitive/rough sketches on paper).

    • Normal method (use z-scores and the standard normal table or software).

  • Process when solving with the normal method:
    1) Draw the normal distribution curve and mark the relevant score(s).
    2) Convert the score(s) to a z-score using z = \frac{X - \mu}{\sigma}.
    3) Look up or compute the probability corresponding to the z-score(s) (or compute the probability between two z-scores for an interval).

  • Practical note: For non-infinite intervals (e.g., between 3 and 5 hours), you compute P(3 ≤ X ≤ 5) as P(X ≤ 5) − P(X ≤ 3).

  • Excel question from the lecture: Is there a single function to compute P(a ≤ X ≤ b) directly?

    • Yes: compute P(a \le X \le b) = \text{NORM.DIST}(b, \mu, \sigma, TRUE) - \text{NORM.DIST}(a, \mu, \sigma, TRUE).

    • Alternative: use NORM.DIST for probabilities and NORM.INV (inverse) to find the value given a probability (quantiles).

Errors in data: sampling vs non-sampling errors

  • Two broad categories of error:

    • Sampling errors: arise from the randomness of taking samples from a population (e.g., small samples may yield an outlier that shifts the mean). These diminish as sample size grows.

    • Non-sampling (systematic) errors: persist irrespective of sample size and can bias results (e.g., measurement or survey bias).

  • Sampling errors:

    • Caused by luck; even random samples can be unrepresentative due to chance.

    • The intuitive rule: larger samples reduce sampling error, but there is no such thing as a free lunch — sometimes increasing sample size is costly or impractical.

  • Non-sampling errors (systematic):

    • Selection bias: e.g., asking volunteers for salary disclosure; those at certain salary levels may be more unlikely to disclose.

    • Nonresponse bias: certain groups are less likely to respond to surveys, causing the respondents to be a non-representative subpopulation.

    • These biases persist regardless of sample size and require design or data collection changes to mitigate.

  • Real-world relevance: both types of errors affect the reliability of inferences from data; understanding and addressing them is critical in practice and in exam-style questions.

Population parameters, samples, and sampling distributions

  • Population parameters (theoretical): mean μ and standard deviation σ describe the entire population.

  • Sample statistics (observed): sample mean \bar{X} and sample standard deviation s derived from a sample of size n.

  • Sampling distribution of the mean:

    • Definition: the distribution of all possible sample means from samples of size n drawn from the population.

    • Key properties:

    • Its mean equals the population mean: E(\bar{X}) = \mu.

    • Its standard deviation is the standard error: SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}. (If σ is unknown and you use s, replace σ with s in the SE formula.)

    • Shape: If the original population is normal, the sampling distribution of the mean is exactly normal for any n; if the population is not normal, the distribution tends toward normal as n increases (Central Limit Theorem).

  • Central Limit Theorem (CLT):

    • If the population is normal, the sampling distribution of the mean is normal regardless of n.

    • If the population is not normal, the sampling distribution of the mean tends to normal as n grows; a common rule of thumb for “large enough” n is 30 (the so-called magic number 30).

    • A more precise statement for non-continuous variables: require that np > 5 and n(1-p) > 5 for proportions.

  • Important takeaway: the sampling distribution provides the basis for inference about the population mean and for constructing confidence intervals.

Confidence intervals for the mean

  • Definition: an interval estimate that comes with a stated confidence level, representing a range that would contain the true population mean with that probability if we repeated the study many times.

  • Two parts of a confidence interval:

    • The interval itself (lower bound L, upper bound U).

    • The associated confidence level (e.g., 95%, 99%).

  • General form (for large samples or known σ):
    \bar{X} \pm z_{\alpha/2} \cdot SE(\bar{X}),
    where SE(\bar{X}) = \frac{\sigma}{\sqrt{n}} (or \frac{s}{\sqrt{n}} if using sample SD).

  • Interpretation caveat: A 95% CI means that if we repeated the sampling process many times and computed a 95% CI each time, about 95% of those intervals would contain the true population mean. It does not guarantee that a specific interval contains the true mean.

  • Example from the lecture:

    • Data: 36 respondents; sample mean \bar{X} = 28.60; sample standard deviation s = 2.40.

    • Standard error: SE = \frac{s}{\sqrt{n}} = \frac{2.40}{\sqrt{36}} = \frac{2.40}{6} = 0.40.

    • For a 95% CI, using z_{0.025} \approx 1.96, the half-width is 1.96 \times 0.40 = 0.784.

    • Confidence interval:
      [27.82, 29.38]

    • Interpretation: We are 95% confident that the true mean willingness to pay in the population lies between $27.82 and $29.38.

  • Practical implications:

    • A wider interval (higher confidence) trades off precision for confidence; to maintain a given precision with higher confidence, a larger sample is needed.

    • The CI here pertains to the population mean (average willingness to pay), not to any individual’s willingness to pay.

  • Note on calculation by hand: the lecture emphasizes these are taught concepts, and in the course you typically do not compute by hand; the formulas are provided and you can use slides or software to compute.

Putting it all together: why this matters in practice

  • The central idea is to move from single measurements to probabilistic inferences about the population through:

    • Transforming raw scores into standardized measures (z-scores) to interpret relative standing and to compute probabilities.

    • Understanding the sampling distribution of the mean to quantify how much a sample mean could vary from the true population mean (standard error).

    • Using the Central Limit Theorem to justify normal-approximation-based inference even when the population is not exactly normal.

    • Constructing confidence intervals to express uncertainty around the population mean and to inform decisions (e.g., pricing, scheduling, budgeting).

  • Ethical and practical implications discussed:

    • Selection bias and nonresponse can distort inferences and should be mitigated by study design and data collection strategies.

    • Acknowledge limitations of the data and avoid over-interpreting interval estimates for individuals; CI applies to the population mean, not to individuals.

Quick references and formulas to memorize

  • Z-score: z = \frac{X - \mu}{\sigma}

  • Probability from a z-score (standard normal): use the standard normal table or software to obtain P(Z ≤ z).

  • Probability between two scores: if X ~ N(µ, σ²) and a ≤ X ≤ b, then
    P(a \le X \le b) = \text{NORM.DIST}(b, \mu, \sigma, TRUE) - \text{NORM.DIST}(a, \mu, \sigma, TRUE).

  • Standard error of the mean: SE(\bar{X}) = \frac{\sigma}{\sqrt{n}} (or \frac{s}{\sqrt{n}} when σ is unknown and s is the sample SD).

  • Confidence interval for the mean: \bar{X} \pm z_{\alpha/2} \cdot SE(\bar{X}).

  • CLT takeaway (shape):

    • If the population is normal, the sampling distribution of the mean is normal for any n.

    • If the population is not normal, the sampling distribution of the mean tends to normal as n increases (commonly n ≈ 30 as a rule of thumb).

  • Proportion rule of thumb for large samples: require np > 5 and n(1-p) > 5 for normal approximation to be reasonable.

Next steps mentioned in class

  • The session leads into our first inference test (tied to confidence intervals and hypothesis testing).

  • Expect further discussion on choosing the right sample size and applying these concepts to real data.

  • Reminder: Excel can handle several of these computations with built-in functions (e.g., NORM.DIST, NORM.INV) to get probabilities and quantiles; practice with your data will help solidify these methods.