Sampling and Empirical Distribution
Population
Set of all elements of interest
Sample
The part of the population upon which data is observed
Random Sample
A sample for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample. You can calculate the probability before you take the sample
All possible samples are not necessarily equally likely
Simple Random Sample = all possible samples of a particular size must be equally likely.
Random samples are important because they are representative samples
Similar to population in all important characteristics
Can be used to draw inference about the population
Good sampling procedures are used to guarantee a representative sample
Convenience samples
Sample consists of whomever “walks by”
similar to a volunteer sample
Neither convenience samples or volunteer sample are useful for inference
Sampling may be used to select units upon which data is collected will be collected from the target population.
To divide a large sample into test and training sets
Training data - used to develop predictive models
Test data - used to evaluate efficiency
Probability distributions can be determined analytically
For complex distributions, simulation is often easier
Empirical Distribution
Empirical = based on observation
Observations can be from repetitions of an experiment
All observed unique values
The proportion of times each value appears
Ex) Ages of 5 students are 18, 18, 19, 21, 22.
P(18) = 2/5, P(19) = 1/5, ….
Large Random Sampling
If you repeat an experiment a bunch of times, the proportion of times that an event occurs tends to get closer to the theoretical probability.
To make inferences, you have to have a measure of reliability. This can be done by bootstrapping ( generating a “new” sample )
Theoretical Sampling Distribution
Sample mean will be approximately normal for large samples