1/74
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Dataset
A structured form of data points usually organised into a tabular form. Each column represents an attribute, and each row represents a value for each attribute, linked in some way (e.g. same subject).
Types of Data Attributes
Numerical (Quantitative) - can be measured/counted, e.g. height
Discrete - Takes a finite/countable number of values, e.g. number of children.
Continuous - Any value in an interval(s) can (theoretically) be taken, e.g. height).
Categorical (Qualitative) - can’t be measured/counted, e.g. gender
Ordinal - Has a natural defined order, e.g. rankings
Nominal - Has no natural order, e.g. colours
Frequency Table
A table that displays the number of occurrences (frequency) of each category in a dataset.
Grouping Continuous Data
Split values into intervals (bins), calculate frequency within each bin, depict with a histogram.
Mean
x̄ = 1/n ∑xi
For real numbers k and a:
mean(x1 + k, x2 + k + …) = mean(x) + k
mean(ax1 + ax2 + …) = mean(x) * a
Median
The middle number (or average of 2 middles) in a dataset.
For real numbers k and a:
median(x1 + k, x2 + k + …) = median(x) + k
median(ax1 + ax2 + …) = median(x) * a
Outlier
A value in a dataset which is vastly different to the majority of data.
Variance
Average of the squared differences from the mean.
Var(x1, x2, …, xn) = 1/n ∑(xi - x̄)²
Alternatively, expressed as:
Var(x1, x2, …, xn) = mean(x²1, … x²n) - x̄², aka ‘the mean of the squares minus the square of the mean)’ - MSMSM.
For real numbers k and a:
var(x1 + k, x2 + k + …) = var(x)
var(ax1 + ax2 + …) = var(x) * a²
Standard Deviation
Amount of dispersion or spread in a dataset.
sd(x1, x2, …, xn) = sqrt(Var(x1, x2, …, xn))
For real numbers k and a:
sd(x1 + k, x2 + k + …) = sd(x)
sd(ax1 + ax2 + …) = sd(x) * |a|
Upper/Lower Quartiles
Lower Quartile - Median of lower half of a dataset
Upper Quartile - Median of upper half of a dataset
i.e. LQ and UQ are med of n each, and if 2n+1, they skip the middle
Interquartile Range
Measure of dispersion less sensitive to outliers.
IQR = UQ - LQ, where:
UQ is upper quartile, median of top half
LQ is lower quartile, median of bottom half
Sample Space (Ω)
The set of all possible outcomes of a random experiment.
Event
A subset of the sample space, i.e. a set of one or more outcomes from Ω.
Set of all events is denoted by F. If result of experiment is in E, then E ‘occurred’.
Mutually Exclusive
Events that cannot occur simultaneously.
Ei ∩ Ej = ∅
For each distinct i and j this must be the case for > 2 events to all be mutually exclusive.
Independent Events
One event doesn’t affect the outcome of another.
P(A ∩ B) = P(A)P(B)
Discrete Random Variable
A function defined on an experiment’s sample space,
mapping X: Ω → something. If X : Ω → R, X is a random variable.
If X’s output range is a finite or countable set, e.g. integers, then X is a discrete random variable.
If g: R → R, g is also a discrete random variable, mapping outcomes w to g(X(w)), i.e. the probability that X = w.
P(X = x)
Shorthand form of writing P({w: X(w) = x}), i.e. the probability that the random variable’s outcome is equal to a specific value x.
Support (of discrete random variable)
The set of values of x for which P(X = x) > 0.
Probability Mass Function fx(x)
fx(x) = P(X == x) for all values x in the support. Creates a ‘bar chart’.
Cumulative Distribution Function Fx(x)
Fx(x) = P(X ≤ x).
For a discrete random variable X, Fx is a straight line at each integer, and then jumps to the next value. Creates the ‘arrow’ graph.
Continuous Random Variable
A random variable that can take any value within a range. A variable X: Ω → R with the property that there is a density function fx, such that:
For all a and b with a ≤ b:
P(a ≤ X ≤ b) = ∫ab fX(u) du
Density Function
A function fx(u) that describes the probability distribution of a continuous random variable X. It is used to determine the probability of X falling within a certain range by integrating over that range.
fx(x) ≥ 0 for all x ∈ R
∫∞-∞ fX(u) du = 1
P(X == x) = 0 for all x ∈ R, i.e. the exact point’s probability is 0 (since integrating from a to a has 0 area).
Uniform Distribution
A probability distribution where all outcomes are equally likely within a specified range.
For a continuous uniform distribution, this has density function:
fx(x) = 1/b-a, a ≤ x ≤ b
So for a uniform distribution from [0, 1], this is fx(x) = 1 for a ≤ X ≤ b.
Cumulative Distribution Function
A function that describes the probability that a continuous random variable, X, will take the value ≤ a constant.
Fx(x) = P(X ≤ x) = ∫x-∞ fX(u) du
e.g.,
P(a ≤ X ≤ b) = P(X ≤ b) - P(X ≤ a) = Fx(b) - Fx(a)
lim x→∞ Fx(x) = 1
lim x→-∞ Fx(x) = 0
Fx is increasing
fx = Fx’
Exponential Distribution
A probability distribution that models the time until an event occurs.
X has exponential distribution, with parameter λ > 0, if density function:
fx(x) = λe-λx, x ≥ 0, 0 otherwise
and therefore cumulative distribution function:
Fx(x) = 1 - e-λx, x ≥ 0, 0 otherwise
X is the time until something happens, e.g. P(X < 1) is the probability a TV fails within 1 year.
Normal Distribution
A probability distribution that is symmetric around the mean, μ.
A continuous random variable X has the normal distribution, i.e. X ~ N(μ, σ²), where σ² is the variance, if its density function is:
fx(x) = (1 / (σ√(2π))) * e^(-(x - μ)² / (2σ²))
When μ = 0 and σ = 1, it’s the Standard Normal Distribution.
Standard Normal Distribution
A form of the Normal Distribution where μ = 0 and σ = 1. It’s denoted by Z, so a continuous random variable X ~ Z(0, 1).
Its cumulative distribution function, Fz(x), is denoted by:
ϕ(x) = P(Z ≤ x)
Functions of Random Variables Method
Since a function of a random variable, i.e. f: X → R, is also a random variable, this is used to derive distributions of unknown functions.
Write the new function’s cumulative distribution function, i.e. Fx(Y)
Rearrange for the known variable, i.e. Fx(X)
Solve for the density function by differentiating
E.g.
fX(x) = 2x, FX(x) = x², Y = f(X) = X²
FY(y) = P(Y ≤ y) = P(X² ≤ y) = P(X ≤ √y) = Fx(√y) = y, so Fy(y) = y
fy(y) = 1 (differentiate)
X ~ N(μ, σ² ) and Y = aX + b, a =/= 0.
Y ~ N(aμ + b, (aσ)²)
This result allows us to use ϕ (CDF of the Standard Normal Distribution) on regular Normally Distributed random variables, because if we take a = 1/σ and b = -μ/σ, we get:
Y = X/σ -μ/σ = (X - μ)/σ
so Y ~ N(0, 1), so Y ~ Z
e.g.
X ~ N(5, 10)
Let Y = (X - 5)/√10. Now, Y ~ N(0, 1), and we can use the Standard Normal Distribution to find probabilities related to X through Z.
P(X ≤ 2) = P(Y ≤ (2 - 5)/√10) = ϕ(−0.9487)
Generating random variables by converting from Uniform
If a special distribution isn’t available, we can convert the uniform distribution’s ‘amount of probability’ into a variable for the new distribution.
F is the cumulative distribution function of the unknown variable.
Take a uniform random variable, U, on [0,1]. And let X = F-1(U). This means X is an inverse function that takes the ‘level’ of uniform probability and maps it to the point along the axis that corresponds to the same probability under F.
P(X ≤ x) = P(F-1(U) ≤ x) (because of this)
= P(U ≤ F(x)) (Probability of point being to the left of x is the same as the probability level of U being under the CDF height at x)
= F(x) (because the meaning of a uniform random variable is that P(X <= x) = x, e.g. P(X <= 0.5 = 0.5)
So P(X ≤ x) = F(x)
In practice:
Take your cumulative distribution and inverse it, i.e. F-1(X)
Generate a random number on the uniform distribution
Sub the value you get into the inverse to get your new random variable X under the new distribution
Jointly Distributed
Random variables defined on the same sample space, with a joint probability distribution defining their behaviour.
Joint Probability/Bivariate Mass Function
A probability mass function for jointly distributed discrete random variables.
fx,y (x, y) = P(X = x, Y = y)
Marginal Mass Function
The probability distribution of a single variable from a jointly-distributed pair of random variables.
e.g. fx(x) = ∑y fx,y(x,y) = P(X = x)
Joint Probability/Bivariate Density Function
A probability density function for jointly distributed continuous random variables.
A function, fx,y(x, y) : ℝ x ℝ → ℝ:
P(a ≤ X ≤ b and c ≤ Y ≤ d) = ∫ab ∫cd fx,y(x, y) dy dx
fx,y(x, y) ≥ 0 for all x, y in ℝ
∫-∞∞ ∫-∞∞ fx,y(x, y) dy dx = 1
Marginal Density Function
The probability density function of a single variable from a jointly-distributed pair of random variables.
e.g. fx(x) = ∫-∞∞ fx,y(x, y) dy
Independent Random Variables
For every pair of A and B:
P(X ∈ A and Y ∈ B) = P(X ∈ A) * P(Y ∈ B)
This means that:
fx,y(x, y) = fx(x) * fy(y)
For jointly-distributed random variables, if X and Y are independent:
If fY(y) > 0, fX, Y(x | y) = fX(x)
Determining Probability of Joint Density Functions
To find the probability that (X, Y) lies in a region R,
P((X, Y) ∈ R) = ∫∫R fx,y(x, y) dy dx, where R defines the region.
The integral of the joint density function represents a volume, i.e. integrating over a region. (X, Y) lying in the region R corresponds to the event occurring.
Example:
If R is defined by the limits a ≤ X ≤ b and c ≤ Y ≤ d, then P((X, Y) ∈ R) = ∫ab ∫cd fx,y(x, y) dy dx.
Conditional Probability
The probability of one event restricted to the sample space of another.
P(A | B) = P(A n B)/P(B)
Jointly Distributed Conditional Probability
Using the equation for conditional probability:
If the variables are discrete:
P(X = x | Y = y) = fx,y(x, y) / fy(y)
fX|Y(x, y) ≥ 0 for all x
∑xfX|Y(x, y) = 1 (since the sample space is now Y)
If the variables are continuous:
fX|Y(x|y) = fX,Y(x, y) / fY(y)
fX|Y(x | y) ≥ 0 for all x
∫∞-∞ fX|Y(x | y) = 1
Bayes Theorem
P(A | B) = P(B|A)P(A) / P(B)
Expectation of Discrete Random Variable
E(X) = ∑x * fX(x) = ∑x * P(X==x)
Expectation of Continuous Random Variable
E(X) = ∫∞-∞ u fX(u) du
Law of the Unconscious Statistician
If X is a random variable and g is a function and Y = g(X):
If discrete:
E(Y) = ∑g(x) * fX(x) = ∑g(x) * P(X==x)
If continuous:
E(Y) = ∫∞-∞ g(u) fX(u) du
For multiple variables (jointly distributed):
If W = g(X, Y), E(W) = ∫∫∞-∞ g(x, y) fX,Y(x, y) dx, dy
Properties of Expectation of Random Variables
E(a0 + a1X1 + … + anXn) = a0 + a1E(X1) + … + anE(X2)
i.e. E(X + Y) = E(X) + E(Y)
If X and Y are jointly-distributed and independent,
E(XY) = E(X)E(Y)
Variance of Random Variables
Var(X) = E(X - E(X))² = E(X²) - E(X)² , ESMSE:
The expected of the squared minus the squared of the expected.
For jointly-distributed, independent random variables,
Var(a0 + a1X + …) = a1²Var(X) + …
Distribution of Sums of Normally-distributed independent variables
If X and Y are both independent and normally distributed, i.e.:
X ~ N(μ1, σ1²) and Y ~ N(μ2, σ2²)
aX + bY ~ N(aμ1 + bμ2, a²σ1² + b²σ2²)
i.e. the variables can be added and are normally distributed as well
Stochastically Dominates
If X and Y are random variables defined on the same sample space, if for every outcome ω, X(ω) ≥ Y(ω), then X stochastically dominates Y.
E.g. X represents the number of heads in 3 coin tosses, while Y represents the number of heads only in the final toss.
Then E(X) ≥ E(Y).
Markhov’s Inequality
If X is a random variable and a > 0, then:
P(|X| ≥ a) ≤ E(|X|)/a
Markov’s inequality tells us that if the expectation of |X| is not large, then the probability that |X| is large is small.
Chebyshev’s Inequality
If X is a random variable and a > 0, then:
P(|X - E(X)| ≥ a) = Var(X)/a²
Chebyshev’s Inequality gives an upper bound on the probability that a variable is far away from its mean.
If we took a = kσ, then this becomes:
P(|X - E(X) ≥ kσ) = 1/k².
The Weak Law of Large Numbers (Theorem 4.4.1)
If X1, X2 … are independent random variables with the same distribution, and E(Xi) = μ and Var(Xi) = σ². For every ε > 0:
P(|Sn/n - μ| ≥ ε) ≤ σ²/(nε²), and as n → infinity, this P → 0
P(|Sn/n - μ| ≥ ε) → 0
Central Limit Theorem (Theorem 4.4.2)
If X1, X2, … are independent random variables with the same distribution, and each E(Xi) = μ and Var(Xi) = σ² > 0.
With Sn = X1 + X2 + … + Xn, and Zn = (Sn - nμ)/σ√n
limn→∞ FZn(z) = Φ(z), and
E(Zi) = 0, Var(Zi) = 1
Using the Central Limit Theorem for large numbers of variables
In questions, if given a large number of independent random variables, e.g. a batch of components, approximate their sum with the standard normal distribution.
Sn/n = X̅ ~ N(μ, σ²/n), where the mean and variance are the sample mean and variance.
Using the Central Limit Theorem for binomial and large n
Decompose binomial into a sum of independent variables with the Bernoulli distribution
E(Xi) = p, and Var(Xi) = p(1-p), so μ = np, and σ² = np(1-p)/n.
Using the Central Limit Theorem, ∑X ~ N(μ, σ²/n).
Continuity Correction (since X is discrete) - P(X == k) ≈ P(k - 0.5 ≤ X ≤ k + 0.5)
So for exam questions, find the sum, then mean and variance (np and np(1-p), then sub into equation, approximate with standard using X - u/sigma with continuity correction and solve.
Covariance
A measure of how 2 variables change together.
Cov(X, Y) = E(X - E(X))(Y - E(Y))
If they are jointly distributed, then
Cov(X, Y) = E(XY) - E(X)E(Y)
If they are independent, then
Cov(X, Y) = 0, since E(XY) would equal E(X)E(Y)
Var(X + Y) for same-distribution variables
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
Pearson correlation coeffiicent
A normalised measure of the correlation between 2 variables.
p(X, Y) = Cov(X, Y) / √(Var(X)Var(Y))
If X and Y are independent, p(X, Y) will = 0.
p = 1 iff P(Y = a + bX) = 1
p = -1 iff P(Y = a - bX) = 1
Sample correlation coefficient
For a given dataset (sample):
rx, y = mean(xy) - mean(x)mean(y) / sd(x)sd(y)
Linear Regression Model
With Y being the dependent variable and X being the independent variable, and n data points (Xi, Yi), the simple linear regression model describes the relationship between the attributes by equation:
Yi = α + βXi + εi
where εi is the error term, which is independent, normally distributed with mean = 0 and unknown variance.
α and β are found by minimising the sum of squares of the residuals, which is the measured - predicted values (yi - (α + βXi)). These are called the least squares estimates of α and β.
R² value (in linear regression)
Var(predicted values of y) / Var(observed values of Y)
Or (rx, y)² for one independent variable.
Comes from:
ŷ = α + βXi
y - ŷ = residual.
So 1/n * sum(yi - ȳ)² = 1/n sum(yi - ŷi)² + 1/n sum (ŷ - ȳ)²
Or, Var(y) = Mean(squared errors) + Var(predicted values)
and R² is the proportion of variance of observed values of Y predicted by the model.
Population Random Sample
A set of independent, identically distributed (IID) random variables, drawn from a larger population with unknown ‘population’ distribution.
Point Estimate of Mean and Variance
A single value estimate of a population parameter, such as the mean or proportion, derived from a sample. Its function is called its ‘point estimator’.
For mean and variance:
E(X̅) = μ
Var(X̅) = σ² / n
Where μ and σ² are the mean and variance of the entire population.
This means we assume the mean to get closer to the right value, (because it’s Sum / n) and variance to get closer to 0 as n → infinity.
S² - Sample Variance for Point Estimations
Regular sample variance is n-1/n * σ², so to fix it we use S² which is the same as variance, but dividing by n-1 instead of n.
E(S²) = 1/n-1(∑X²) - n/n-1(x̄)²
Confidence Interval with normal-distribution
A random interval (XL, XU) where the parameter is inside with probability/confidence level (1-α). This is a 100(1-α)% confidence interval. Used when variance is known.
To calculate:
[X̅ - (σzα/2)/√n, X̅ + (σzα/2)/√n]
where:
X̅ is the sample mean
σ is the standard deviation
za/2 is the value where P(Z > zα/2) = α/2 (given in Q)
n is the number of samples
t-distribution
Family of distributions generalising Normal distributions with a parameter called degrees of freedom, v. Has heavier tails.
As v → ∞, the t-distribution → normal.
Confidence Interval with t-distribution
Used when variance is unknown.
For a 100(1-α)% confidence interval:
To calculate:
[X̅ - (stn-1,α/2)/√n, X̅ + (stn-1,α/2)/√n]
X̅ is the sample mean
s is the sample standard deviation (unbiased)
tn,a/2 is the value where P(T > tn,a/2) = α/2 (given in Q)
n is the number of samples
Conditional Expectation E(Y | A)
E(Y | A) = ∑y (y * P(Y == y | A))
e.g. E(dice roll | odd) = ∑{1, 3, 5} / {1, 3, 5}
For continuous:
E(Y | A) = ∫∞-∞y fY|A(y, a)dy
Conditional Expectation on another variable E(Y | X)
E(Y | X==x) = ∑y (y * P(Y == y | X == x)) = ∑y y*fY|X(y, x)
e.g. Y is dice roll result, X is 0 if even, 1 if odd
A is event of being odd
E(Y | A) = E(Y | X = 1) = 3 and E(Y | Ac) = E(Y | X = 0) = 4
So you could also conclude E(Y | X == x) = E(Y|X) = 4 - X
Conditional Expectation collapse
E(E(Y|X)) = E(Y)
On the right is a sum or integral over the possible values that Y can take.
On the left we have the expectation of the random variable E(Y |X)
which is a function of X . So the outer expectation is a sum or integral over all the possible values that X can take
Random k-vector X
A column vector with k jointly-distributed random variables as components.
Mean of random k-vector
μ = (μ1, … uk) where ui = E(Xi)
Covariance Matrix
A kxk matrix where the diagonal has the Var(Xi) and everywhere else has Covariance(Xi, Xj).
These are symmetric since covariance is symmetric.
Affine Transformation
For vectors/matrices,
Y= AX + c
A is an mxk matrix
X is a kx1 random vector
c is a column m-vector
Y is a random vector result.
Mean and Covariance of Affine Transformation
If Y = Ax + c (an affine transformation):
Mean vector is Aμ + c
Covariance is AΣAT
(Σ is the original cov. matrix)
Affine Transformation for Mean 0 and Variance 1
With random vector X, and assuming its covariance matrix is invertible:
P is the orthogonal matrix for P-1∑P = D
X’’ = P-1(X - μ)
Y = 1/√D * X’’
i.e. Shift, Rotate, Rescale
Multivariate Normal Distribution Properties
If X has the multivariate normal distribution:
Y = AX + c also has the distribution (when A∑AT is invertible)
For each i, marginal distribution of Xi is normal with mean μi and variance Σi,i.
If Xi and Xj are uncorrelated, they’re also independent.