Normal Distribution and Estimation
Probability Distributions
Assigns probabilities to possible values (normal, uniform, binomial)
Probability distributions can be determined analytically
For complex distributions, simulation is often easier
Empirical Distribution
Means based on observation
Takes a sample of measurements, then assigns probability 1/n to each of the observed values. “Resampling with replacement”
For “large samples,” the empirical distribution based on a random sample will tend to reflect the properties of the underlying probability distribution
Random Variables
A random variable is a mapping from an outcome to a number
Discrete random variables
Only assume countable numbers of possible values
Probabilities are defined by a probability function
Continuous random variable
May assume any number in an interval
Probability = area under curve
< is the same as <= and > is the same as >=
Percentiles
An observation’s percentile is the percentage of the population that is equal to or less than the observation
The percentile rank of x = probability the random variable will be <= x
Also called the CDF (Cumulative distribution function)
For continuous RV (random variable), percentile = area under the curve to the left
Probability X is to the left of something = CDF
P(X<a) = P(x<=a) = CDF(a)
The inverse CDF
If CDF(a)=p, then InvCDF(p)=a
What is the 10th percentile of X?
a=10th percentile
P(X<=a) = .10
CDF(a) = .10
a = InvCDF(.10)
Ex) find a such that P(X>a) = .05
P(X<=a) = .95
CDF(a) = .95
a = InvCDF(.95)
Normal curve in R
Standard normal = normal with mean = 0 and standard deviation = 1
pnorm(z) returns the area under the curve to the left of z. This is the CDF
qnorm(area) returns the value of z with area to the left of it. This is InvCDF
Z-score
Standard units = z-score
z = standard score = (original score - mean) / SD
By default, pnorm and qnorm assume standard normal (z). You can specify different means if you’d like.
Proportions
• The proportion of the population with a measurement in a certain range is equal to
the area under the frequency curve over that range.
• Proportion of the population and probability of selection are equivalent.
At UNLV, students average 27 years of age with a standard deviation of 3 years. Ages of UNLV students are approximately normal
mean = 27
Standard deviation = 3
What proportion of UNLV students are between 24 and 30?
Convert to Z → (x-mean)/SD = (24-27)/3 = -1
(30-27)/3 = 1
P(24<=x<=30) = P(-1<=Z<=1) = pnorm(1) – pnorm(-1) = 0.8413447 - 0.1586553 = 0.6826895
Working backward:
Z = (observed value - mean) / SD
Whats the z-score for a 19 eyar old student at UNLV?
z = (x - μ)/σ = (19 – 27)/3 = -2.67
A 19 year old UNLV student is 2.67 standard deviations younger than the mean.
How old is a UNLV student if their age’s z-score is 2.5?
x = mean + stdev(z) = 27 + 3(2.5) = 34.5
How big are most of the values?
The bulk of the data should fall in between a few standard deviations of the mean
For a normal distribution, almost all of the data are in the range “average +- 3 SDs”
Parameters, Statistics, and Estimation
A parameter: A numeric description of the population
Population mean
Population Standard Deviation
Population Portion
Statistic: A numeric description of the sample
x bar = sample mean
s = sample standard deviation
p hat = sample proportion
A sample statistic can be used as an estimate of the corresponding population parameter
Values of a statistic vary because random samples vary
Sampling Distribution = probability distribution of a statistic based on every possible sample of a particular size
Bootstrapping: Using the empirical distribution to generate “new” samples in order to estimate the properties of a sampling distribution
If the original population has mean = mew and standard deviation, then the sampling distribution x bar will have mean = mew and standard deviation = sigma/sqrt(n)
Central Limit Theorem:
● If we consider repeated samples of size n from any distribution and compute 𝑥 = sample mean = sample average, the sampling distribution of 𝑥 will be approximately normal for large samples.
● For large samples, we can treat x as a normal random variable.
Empirical Distribution
“Empirical” = based on observation
For a sample of n measurements, the empirical distribution assigns probability 1/n to each of the observed values.
Is the original population normal?
if “yes,” then the sampling distribution of 𝑥 bar will be normal.
If “no,” then ask “Is the sample size large (at least 30)?”
If “yes,” the sampling distribution of 𝑥 bar will be approximately normal (by CLT).
If “no,” the form of the sampling distribution of 𝑥 bar is unknown and should be analyzed without assuming normality.