AJ

Stat 5101 Lecture Slides: Deck 1 Probability and Expectation on Finite Sample Spaces

Sets

  • In mathematics, a set is a collection of objects considered as a single entity.
  • The objects within a set are called its elements.
  • x \in S denotes that x is an element of the set S.
  • A \subset S indicates that set A is a subset of set S, meaning every element of A is also an element of S.
  • Sets can be indicated by listing elements within curly brackets, e.g., {1, 2, 3, 4}.
  • Sets can contain various types of elements, not just numbers; e.g., {1, 2, \pi, \text{cabbage}, {0, 1, 2}}.
  • The empty set is the unique set containing no elements.
  • The empty set is denoted by \emptyset or {}.
  • N denotes the set of natural numbers {0, 1, 2, …}.
  • Z denotes the set of integers {…, -2, -1, 0, 1, 2, …}.
  • R denotes the set of real numbers.
  • Set builder notation: { x \in S : \text{some condition on x} } represents the set of elements in S that satisfy the specified condition.
    • { h(x) : x \in S } denotes the image or range of the function h with domain S.
    • Example: { x \in R : x > 0 } is the set of positive real numbers.

Intervals

  • Intervals are a special type of set.
  • Notation:
    • (a, b) = { x \in R : a < x < b }: Open interval with endpoints a and b. (1)
    • [a, b] = { x \in R : a \leq x \leq b }: Closed interval with endpoints a and b. (2)
    • (a, b] = { x \in R : a < x \leq b }: Half-open interval. (3)
    • [a, b) = { x \in R : a \leq x < b }: Half-open interval. (4)
    • These notations assume a and b are real numbers with a < b.
  • Infinite intervals:
    • (a, \infty) = { x \in R : a < x }: Open interval. (5)
    • [a, \infty) = { x \in R : a \leq x }: Closed interval. (6)
    • (-\infty, b) = { x \in R : x < b }: Open interval. (7)
    • (-\infty, b] = { x \in R : x \leq b }: Closed interval. (8)
    • (-\infty, \infty) = R: The set of all real numbers, both open and closed. (9)

Functions

  • A mathematical function is a rule that maps each point in one set (the domain) to a point in another set (the codomain).

  • Functions can also be called maps, mappings, or transformations.

  • Functions are often denoted by single letters, such as f.

  • f(x) represents the value of the function f at the point x.

  • If X is the domain and Y the codomain of the function f, this is written as f : X \rightarrow Y or X \xrightarrow{f} Y.

  • To define a function, a formula can be used, e.g., f(x) = x^2, x \in R, where the domain is specified.

  • Alternatively, the notation x \mapsto x^2 can be used, read as “x maps to x squared,” but the domain must be indicated separately.

  • For small finite sets, a table can define a function:

    Input1234
    Output1/102/103/104/10
  • Functions can map any set to any set. For instance, one could have:

    • Input red | orange | yellow | green | blue
    • Output tomato | orange | lemon | lime | corn
  • It's crucial to be precise about the domain of a function.

  • For example, f(x) = \sqrt{x} is only properly defined for x \geq 0.

  • The exponential function is denoted as
    R \xrightarrow{\exp} (0, \infty)

    • with values \exp(x), also written as x \mapsto e^x.
  • The logarithmic function is denoted as
    (0, \infty) \xrightarrow{\log} R

    • with values \log(x).
  • These functions are inverses of each other:

    • \log(\exp(x)) = x for all x in the domain of \exp.
    • \exp(\log(x)) = x for all x in the domain of \log.
  • In this course, \log(x) always denotes the base e logarithm (natural logarithm).

  • Constant functions (e.g., x \mapsto c) and identity functions (e.g., x \mapsto x) are simple but important.

  • It is more correct to say x \mapsto x^2 rather than x^2 is a function.

Probability Models

  • A probability model, also called a probability distribution, is a fundamental concept in probability theory.
  • Specifying a probability model can be done in several ways:
    • Probability mass function (PMF).
    • Probability density function (PDF).
    • Distribution function (DF).
    • Probability measure.
    • Expectation operator.
    • Function mapping from one probability model to another.

Probability Mass Functions (PMF)

  • A probability mass function (PMF) is a function f: S \rightarrow R where:
    • S is the sample space (nonempty set).
    • R is the set of real numbers.
    • f(x) \geq 0 for all x \in S (non-negativity).
    • \sum_{x \in S} f(x) = 1 (sums to one).
  • If the sample space is {x1, …, xn}, the PMF can be written as g(xi) \geq 0 for i = 1, …, n and \sum{i=1}^{n} g(x_i) = 1.
  • The underlying concept of a PMF is more important than the specific notation used.

Interpretation of PMFs

  • An element of the sample space is called an outcome.
  • The value f(x) of the PMF at an outcome x is the probability of that outcome.
  • Probabilities are real numbers between 0 and 1, inclusive.
  • Probability 0 means "cannot happen" (or is ignored).
  • Probability 1 means "certain to happen" (or the possibility of it not happening is ignored).

Finite Probability Models

  • A probability model is finite if its sample space is a finite set.
  • Example: Smallest possible sample space S = {x}, with f(x) = 1.
  • Example: Next simplest sample space S = {x1, x2}, with f(x1) = p and f(x2) = 1 - p where 0 \leq p \leq 1.

Bernoulli Distribution

  • A probability distribution on the sample space {0, 1} is called a Bernoulli distribution.
  • If f(1) = p, then it is denoted as Ber(p).
  • A Bernoulli distribution can represent any two-point set by coding the points.

Statistical Models

  • A statistical model is a family of probability models.
  • The Bernoulli distribution often refers to the Bernoulli family of distributions, the set of all Ber(p) distributions for 0 \leq p \leq 1.
  • The PMF of the Ber(p) distribution can be defined by
    f_p(x) = \begin{cases} 1 - p, & x = 0 \ p, & x = 1 \end{cases}
  • This family of PMFs is { f_p : 0 \leq p \leq 1 }.
  • x is the argument of the function f_p, while p is the parameter.
  • The set of allowed parameter values is called the parameter space.
  • For the Bernoulli statistical model, the parameter space is the interval [0, 1] .
  • Example: Sample space with three points {x1, x2, x3}, with f(x1) = p1, f(x2) = p2, and f(x3) = 1 - p1 - p2.

Discrete Uniform Distribution

  • For a sample space {x1, …, xn}, the uniform distribution assigns equal probability to each outcome.
  • The PMF is defined as f(x_i) = \frac{1}{n} for i = 1, …, n.
  • Applications include coin flips and dice rolls.
  • Coin flip: Modeled by the uniform distribution on a two-point sample space.
  • Die roll: Modeled by a uniform distribution on a six-point sample space.

Supports

  • The support of a probability distribution is the set { x \in S : f(x) > 0 }, where S is the sample space and f is the PMF.
  • The distribution is concentrated on the support.
  • Points not in the support can be removed from the sample space without consequence.
  • In the Bernoulli family, all distributions have support {0, 1} except for
    • the distribution for p = 0, which is concentrated at 0, and
    • the distribution for p = 1, which is concentrated at 1.

Events and Measures

  • A subset of the sample space is called an event.
  • If f is the PMF, the probability of an event A is defined by
    \Pr(A) = \sum_{x \in A} f(x)
  • By convention, a sum with no terms is zero, so \Pr(\emptyset) = 0.
  • This defines a probability measure \Pr that maps events to real numbers
    A \mapsto \Pr(A).
  • PMF and probability measures determine each other
    \Pr(A) = \sum_{x \in A} f(x), A \subset S
  • goes from PMF to measure, and
    f(x) = \Pr({x}), x \in S
  • goes from measure to PMF.
  • Note the distinction between the outcome x and the event {x}.
  • For any event A, we have \Pr(A) \geq 0 because all the terms in the sum in \Pr(A) = \sum_{x \in A} f(x) are nonnegative.
  • For any event A, we have \Pr(A) \leq 1 because all the terms in the sum in \Pr(A) = 1 - \sum_{x \in S x \notin A} f(x) are nonnegative.

Random Variables and Expectation

  • A real-valued function on the sample space is called a random variable.
  • If f is the PMF, then the expectation of a random variable X is defined by
    E(X) = \sum_{s \in S} X(s)f(s)
  • This defines an expectation operator E that maps random variables to real numbers
    X \mapsto E(X).

Sets Again: Cartesian Product

  • The Cartesian product of sets A and B, denoted A \times B, is the set of all pairs of elements
    A \times B = { (x, y) : x \in A \text{ and } y \in B }
  • We write the Cartesian product of A with itself as A^2.
  • In particular, R^2 is the space of two-dimensional vectors or points in two-dimensional space.
  • Similarly for triples
    A \times B \times C = { (x, y, z) : x \in A \text{ and } y \in B \text{ and } x \in C }
  • We write A \times A \times A = A^3.
  • In particular, R^3 is the space of three-dimensional vectors or points in three-dimensional space.
  • Similarly for n-tuples
    A1 \times A2 \times \cdots \times An = { (x1, x2, …, xn) : xi \in Ai, i = 1, …, n }
  • We write A \times A \times \cdots \times A = A^n when there are n sets in the product.
  • In particular, R^n is the space of n-dimensional vectors or points in n-dimensional space.
  • Any function of random variables is a random variable.

Averages and Weighted Averages

  • The average of the numbers x1, …, xn is
    \frac{1}{n} \sum{i=1}^{n} xi
  • The weighted average of the numbers x1, …, xn with the weights w1, …, wn is
    \sum{i=1}^{n} wi x_i
  • The weights in a weighted average are required to be nonnegative and sum to one
  • Expectation and weighted averages are the same concept in different language and notation.
  • In expectation we sum
    values of random variable · probabilities
  • in weighted averages we sum
    arbitrary numbers · weights
  • but weights are just like probabilities (nonnegative and sum to one) and the values of a random variable can be defined arbitrarily (whatever we please) and are numbers.

Random Variables and Expectation (cont.)

*When using f for the PMF, S for the sample space, and x for points of S, if S \subset R, then we often use X for the identity random variable x -> x
\begin{aligned}
E(X) &= \sum{x \in S} xf(x) \ E{g(X)} &= \sum{x \in S} g(x)f(x)
\end{aligned}

Probability of Events and Random Variables

  • Suppose we are interested in \Pr(A), where A is an event involving a random variable
    A = { s \in S : 4 < X(s) < 6 }
  • A convenient shorthand for this is \Pr(4 < X < 6).
  • The explicit subset A of the sample space the event consists of is not mentioned.
  • Nor is the sample space S explicitly mentioned.
  • Since X is a function S \rightarrow R, the sample space is implicitly mentioned.

Sets Again: Set Difference

  • The difference of sets A and B, denoted A \ B, is the set of all points of A that are not in B
    A \backslash B = { x \in A : x \notin B }

Functions Again: Indicator Functions

  • If A \subset S, the function S \rightarrow R defined by
    I_A(x) = \begin{cases}
    0, & x \in S \backslash A \
    1, & x \in A
    \end{cases}
  • is called the indicator function of the set A.
  • If S is the sample space of a probability model, then I_A : S \rightarrow R is a random variable.

Indicator Random Variables

  • Any indicator function I_A on the sample space is a random variable.
  • Conversely, any random variable X that takes only the values zero or one (we say zero-or-one-valued) is an indicator function.
  • Define
    A = { s \in S : X(s) = 1 }
  • Then
    X = I_A

Probability is a Special Case of Expectation

  • If \Pr is the probability measure and E the expectation operator of a probability model, then
    \Pr(A) = E(I_A), \text{ for any event A}

Philosophy

  • Philosophers and philosophically inclined mathematicians and scientists have spent centuries trying to say exactly what probability and expectation are.
  • This project has been a success in that it has piled up an enormous literature.
  • It has not generated agreement about the nature of probability and expectation.
  • If you ask two philosophers what probability and expectation are, you will get three or four conflicting opinions

Frequentism

  • The frequentist theory of probability and expectation holds that they are objective facts about the world.
  • Probabilities and expectations can actually be measured in an infinite sequence of repetitions of a random phenomenon, if each repetition has no influence whatsoever on any other repetition.
  • Let X1, X2, … be such an infinite sequence of random variables and for each n define
    \bar{X}n = \frac{1}{n} \sum{i=1}^{n} X_i
  • then \bar{X}n gets closer and closer to E(Xi) — which is assumed to be the same for all i because each X_i is the “same” random phenomenon — as n goes to infinity.

Subjectivism

  • The subjectivist theory of probability and expectation holds that they are all in our heads, a mere reflection of our uncertainty about what will happen or has happened.
  • Consequently, subjectivism is personalistic.
  • You have your probabilities, which reflect or “measure” your uncertainties.
  • I have mine.
  • There is no reason we should agree, unless our information about the world is identical, which it never is.
  • Hiding probabilities and expectations inside the human mind, which is incompletely understood, avoids the troubles of frequentism, but it makes it hard to motivate any properties of such a hidden, perhaps mythical, thing.

Formalism

  • The mainstream philosophy of all of mathematics — not just probability theory — of the twentieth century and the twenty-first, what there is of it so far, is formalism.
    Mathematics may be defined as the subject in which we never know what we are talking about, nor whether what we are saying is true — Bertrand Russell
  • Formalists only care about the form of arguments, that theorems have correct proofs, conclusions following from hypotheses and definitions by logically correct arguments.
  • It does not matter what the hypotheses and definitions “really” mean (“we never know what we are talking about”) nor whether they are “really” true (“nor whether what we are saying is true”).

Everyday Philosophy

How statisticians really think about probability and expectation.

  • You’ve got two kinds of variables:
    • random variables are denoted by capital letters like X and
    • ordinary variables are denoted by lower case letters like x.
  • A random variable X doesn’t have a value yet, because you haven’t seen the results of the random process that generates it. After you have seen it, it is either a number or an ordinary variable x standing for whatever number it is.

Change of Variable

  • Suppose f_X is the PMF of a random variable X having sample space S, and Y = g(X) is another random variable.
  • If we want to consider Y as the “original” random variable rather than X, then we need to determine its PMF f_Y.
  • This is a function on the codomain of g, call that T, given by
    f_Y(y) = \Pr(Y = y), y \in T.
  • and \Pr(Y = y) = \Pr{g(X) = y} = \sum{x \in S : g(x)=y} fX(x)
  • Thus we have derived the change-of-variable formula for discrete probability distributions.
    fY(y) = \sum{x \in S : g(x)=y} f_X(x), \quad y \in T. \quad (*)
  • The probability distribution with PMF fY is sometimes called the image distribution of the distribution with PMF fX because its support is the image of the support of X under the function g
    g(S) = { g(x) : x \in S }
    (if S is the support of X). But (*) works even if S is larger than the support of X.

The PMF of a Random Variable

  • A random variable is a function on the sample space. Hence it induces an image distribution by the change-of-variable formula.
  • We say two random variables X and Y having different probability models (possibly different sample spaces and different PMF’s) are equal in distribution or have the same distribution if they have the same image distribution.
  • If probability theory is to make sense, it had better be true that if Y = g(X) and fX and fY are the PMF’s of X and Y , then
    E(Y ) = \sum{y \in T} yfY(y) = E{g(X)} = \sum{x \in S} g(x)fX(x)

The PMF of a Random Vector

  • For any random variable X taking values in a finite subset S of R and any random variable Y taking values in a finite subset T of R define
    f(x, y) = \Pr(X = x \text{ and } Y = y), \quad (x, y) \in S \times T.
  • By the change-of-variable formula, f : S \times T \rightarrow R is the PMF of the two-dimensional random vector (X, Y ).
  • For any random variables X1, X2, …, Xn taking values in finite subsets S1, S2, …, Sn of R, respectively, define
    f(x1, x2, …, xn) = \Pr(Xi = xi, i = 1, …, n), \quad (x1, x2, …, xn) \in S1 \times S2 \times \cdots \times S_n.
  • By the change-of-variable formula, f : S1 \times S2 \times \cdots \times Sn \rightarrow R is the PMF of the n-dimensional random vector (X1, X2, …, Xn).

Independence

  • The only notion of independence used in probability theory, sometimes called statistical independence or stochastic independence for emphasis, but the adjectives are redundant.
  • Random variables X1, …, Xn are independent if the PMF f of the random vector (X1, …, Xn) is the product of the PMF’s of the component random variables
    f(x1, …, xn) = \prod{i=1}^{n} fi(xi), \quad (x1, …, xn) \in S1 \times \cdots \times S_n
  • where
    fi(xi) = \Pr(Xi = xi), \quad xi \in Si

Counting

  • How many ways are there to arrange n distinct things?
  • You have n choices for the first. After the first is chosen, you have n − 1 choices for the second. After the second is chosen, you have n − 2 choices for the third.
  • There are
    n! = n(n − 1)(n − 2) · · · 3 · 2 · 1
    arrangements, which is read “n factorial”.
  • n factorial can also be written
    n! = \prod_{i=1}^{n} i
    By definition 0! = 1. There is one way to order zero things.
  • How many ways are there to arrange k things chosen from n distinct things?
    After the first is chosen, you have n − 1 choices for the second. After the second is chosen, you have n − 2 choices for the third.
    You stop when you have made k choices. There are
    (n)_k = n(n − 1)(n − 2) · · · (n − k + 1)
    arrangements, which is read “the number of permutations of n things taken k at a time”.

The Binomial Distribution

  • Let X1, …, Xn be independent and identically distributed Bernoulli random variables.
    Identically distributed means they all have the same parameter value: they are all Ber(p) with the same p.
    Define
    Y = X1 + … + Xn.
    The distribution of Y is called the binomial distribution for sample size n and success probability p, indicated Bin(n, p) for short.

Binomial Distribution (cont.)

Hence the binomial distribution has PMF
f(x) = \binom{n}{x} p^x (1 − p)^{n−x}, \quad x = 0, 1, …, n
The sample space is {0, 1, …, n} and the parameter space is [0, 1] just like for the Bernoulli distribution.

Addition Rules

  • We have now met another “brand name” distribution Bin(n, p).
  • We have also met our first “addition rule”.
  • If X1, …, Xn are independent and identically distributed (IID) Ber(p) random variables, then Y = X1 + \cdots + Xn is a Bin(n, p) random variable.