Normal Distribution and Estimation

Probability Distributions

Assigns probabilities to possible values (normal, uniform, binomial)

Probability distributions can be determined analytically

For complex distributions, simulation is often easier

Empirical Distribution

Means based on observation

Takes a sample of measurements, then assigns probability 1/n to each of the observed values. “Resampling with replacement”

For “large samples,” the empirical distribution based on a random sample will tend to reflect the properties of the underlying probability distribution

Random Variables

A random variable is a mapping from an outcome to a number

  • Discrete random variables

    • Only assume countable numbers of possible values

    • Probabilities are defined by a probability function 

  • Continuous random variable

    • May assume any number in an interval

    • Probability = area under curve

    • < is the same as <= and > is the same as >=

Percentiles

An observation’s percentile is the percentage of the population that is equal to or less than the observation

The percentile rank of x = probability the random variable will be <= x

Also called the CDF (Cumulative distribution function)

For continuous RV (random variable), percentile = area under the curve to the left

  • Probability X is to the left of something = CDF

    • P(X<a) = P(x<=a) = CDF(a)

The inverse CDF

  • If CDF(a)=p, then InvCDF(p)=a

  • What is the 10th percentile of X?

    • a=10th percentile

    • P(X<=a) = .10

    • CDF(a) = .10

    • a = InvCDF(.10)

Ex) find a such that P(X>a) = .05

  • P(X<=a) = .95

  • CDF(a) = .95

  • a = InvCDF(.95)

Normal curve in R

Standard normal = normal with mean = 0 and standard deviation = 1

pnorm(z) returns the area under the curve to the left of z. This is the CDF

qnorm(area) returns the value of z with area to the left of it. This is InvCDF

Z-score

Standard units = z-score

z = standard score = (original score - mean) / SD

By default, pnorm and qnorm assume standard normal (z). You can specify different means if you’d like.

Proportions

• The proportion of the population with a measurement in a certain range is equal to

the area under the frequency curve over that range.

• Proportion of the population and probability of selection are equivalent.

At UNLV, students average 27 years of age with a standard deviation of 3 years. Ages of UNLV students are approximately normal

  • mean = 27

  • Standard deviation = 3

What proportion of UNLV students are between 24 and 30?

Convert to Z → (x-mean)/SD = (24-27)/3 = -1

(30-27)/3 = 1

P(24<=x<=30) = P(-1<=Z<=1) = pnorm(1) – pnorm(-1) = 0.8413447 - 0.1586553 = 0.6826895

Working backward:

Z = (observed value - mean) / SD

Whats the z-score for a 19 eyar old student at UNLV?

z = (x - μ)/σ = (19 – 27)/3 = -2.67

A 19 year old UNLV student is 2.67 standard deviations younger than the mean.

How old is a UNLV student if their age’s z-score is 2.5?

x = mean + stdev(z) = 27 + 3(2.5) = 34.5

How big are most of the values?

The bulk of the data should fall in between a few standard deviations of the mean

For a normal distribution, almost all of the data are in the range “average +- 3 SDs”

Parameters, Statistics, and Estimation

A parameter: A numeric description of the population

  • Population mean

  • Population Standard Deviation

  • Population Portion

Statistic: A numeric description of the sample

  • x bar = sample mean

  • s = sample standard deviation

  • p hat = sample proportion

A sample statistic can be used as an estimate of the corresponding population parameter

Values of a statistic vary because random samples vary

Sampling Distribution = probability distribution of a statistic based on every possible sample of a particular size

Bootstrapping: Using the empirical distribution to generate “new” samples in order to estimate the properties of a sampling distribution

If the original population has mean = mew and standard deviation, then the sampling distribution x bar will have mean = mew and standard deviation = sigma/sqrt(n)

Central Limit Theorem:

● If we consider repeated samples of size n from any distribution and compute 𝑥 = sample mean = sample average, the sampling distribution of 𝑥 will be approximately normal for large samples.

● For large samples, we can treat x as a normal random variable.

Empirical Distribution

“Empirical” = based on observation

For a sample of n measurements, the empirical distribution assigns probability 1/n to each of the observed values.

Is the original population normal?

  • if “yes,” then the sampling distribution of 𝑥 bar will be normal.

  • If “no,” then ask “Is the sample size large (at least 30)?”

    • If “yes,” the sampling distribution of 𝑥 bar will be approximately normal (by CLT).

    • If “no,” the form of the sampling distribution of 𝑥 bar is unknown and should be analyzed without assuming normality.