Sampling and Empirical Distribution

Population

  • Set of all elements of interest

Sample

  • The part of the population upon which data is observed

Random Sample

  • A sample for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample. You can calculate the probability before you take the sample

  • All possible samples are not necessarily equally likely

  • Simple Random Sample = all possible samples of a particular size must be equally likely.

Random samples are important because they are representative samples

  • Similar to population in all important characteristics

  • Can be used to draw inference about the population

  • Good sampling procedures are used to guarantee a representative sample

Convenience samples

  • Sample consists of whomever “walks by”

  • similar to a volunteer sample

  • Neither convenience samples or volunteer sample are useful for inference

Sampling may be used to select units upon which data is collected will be collected from the target population.

To divide a large sample into test and training sets

  • Training data - used to develop predictive models

  • Test data - used to evaluate efficiency

Probability distributions can be determined analytically

For complex distributions, simulation is often easier

Empirical Distribution

Empirical = based on observation

Observations can be from repetitions of an experiment

  • All observed unique values

  • The proportion of times each value appears

  • Ex) Ages of 5 students are 18, 18, 19, 21, 22.

    • P(18) = 2/5, P(19) = 1/5, ….

Large Random Sampling

If you repeat an experiment a bunch of times, the proportion of times that an event occurs tends to get closer to the theoretical probability.

To make inferences, you have to have a measure of reliability. This can be done by bootstrapping ( generating a “new” sample )

Theoretical Sampling Distribution

Sample mean will be approximately normal for large samples