1/119
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Tokenization
Mapping of words to numbers
Span
Plane created by a matrix of vectors
Orthogonal
X1^T*X2=0
Full rank
Spans the entire space, will always make it possible to make a perfect prediction
Perfect prediction
To predict a vector of n elements, you need to define enough variables to produce a matrix of rank n
Set
Any collection, group, or conglomerate with no repeated elements
Elements
Members of a set
Disjoint set
sets with no elements in common
Sample space
Set comprising of all possible outcomes associated with an experiment
Events
Subset of the sample space
Sigma algebra
For discrete sets: all of the subsets are the sigma algebras
Probability function
Maps a sigma algebra of a sample to a subset of the reals
Axioms of probability
For an event Pr(A) >= 0, all possible events Pr(all) = 1, if you take a countable collection of pairwise disjoint events in the σ-algebra, the probability of their union equals the sum of their individual probabilities.
Sigma algebra -> probability function
Needs to satisfy the axioms of probability
Random variable
Variables with probability associated
Probability mass function (pmf)
Probability distribution on a variable
Probability density function (pdf)
Does not represent the probability of a specific value, rather the probability that X falls in an interval, probability calculating function
Probability of a value on the pdf
0
Marginal distribution
Probability distribution of a single variable without regard for other variables
Functionals
Maps a function to a scalar
Expectation
Center
Covariance
A measure of linear association between two variables (if they change together). Positive values indicate a positive relationship; negative values indicate a negative relationship
Parameter
Constant (theta) which indexes a probability model belonging to a family of models
Parameterized probability model
Probability model for the random variable we are interested in
Bernoulli
family of probability models indexed by theta where can be between 0 and 1
Maximum likelihood estimation
function that takes in a sample and outputs a value that is our estimate of the true model parameter
Likelihood
function with the form of a probability function, consider to be a function of the parameter for a fixed sample
Form of likelihood
sampling distribution (probability distribution) of the iid sample
Parameter value inputs, samples observed are the parameters
Likelihood does not operate as probability function
For continuous cases, likelihood is the likelihood of the point
iid
Independent and identically distributed random variables.
Statistic
A function on a sample
P-value
The probability of obtaining a value of a statistic or more extreme conditional on the null hypothesis being true
Hypothesis
Assumption about a parameter
Two tailed
Usure about which direction that the extreme is in
One tailed
Wrong in one direction of the extreme
Alpha
Threshold value, p-vaues below this value are rejected, above cannot be rejected
H0 is true, cannot reject H0
1-alpha, correct
H0 is true, reject H0
Alpha, type I error
H0 is false, cannot reject H0
Beta, type I error
H0 is false, reject H0
1-beta, power
Decrease alpha
Higher beta, type II error
Increasing alpha results in
High type I error, low type II error
Power increases with
Increased sample size, increase alpha
Power definition
How far the true value of the parameter is from the H0
Alternative hypothesis
Set of values for where we suspect our true parameter value with fall if our H0 is incorrect
Generalized linear models
Probability of Y|X is in the exponential family of distributions, link functions (inverses), error is a function of only Xbeta
Link function
Function that acts of the expected values of Y given X, monotonic
Monotonic
Can define an inverse function
Support vector machines
Find a liner function that provides maximum separation of the Y or the maximum margin between the two groups
Gradient descent algorithm
Moves the value by the gradient of the function, weighing a value at each iteration
Gradient
Steepest descent on the function
Epochs
Number of times the algorithm will run
Conjugate prior
a probability distribution that, when combined with a specific likelihood function using Bayes' theorem, results in a posterior distribution belonging to the same parametric family as the prior
Informative prior
strong predefined knowledge about a parameter before observing any data.
Improper Prior
cannot define a uniform probability, infinite interval
Posterior mean
Expected value of a parameter derived from its posterior distribution
marginal distribution of posterior
the probability distribution of a subset of parameters in a Bayesian model, calculated by integrating the full joint posterior distribution over all other "nuisance" parameters
Stochastic process
Collection of random vectors with defined conditional relationships, often indexed by an ordered set t
Loss function
Puts a number on each possible outcome to quantify how bad the outcome would be given that it is not the true value
Risk function
Expected value of the loss function (integrate over the probability of each of the possible values the decision function could map to)
Maximum a posteriori (MAP) estimation
Selects the value of the parameter that maximizes the posterior NOT BAYESIAN
Bayesian estimator
Selects the posterior mean
Bayes classifier
predicts class labels by calculating the highest posterior probability for a given data point, calculates posterior probability for each class given features choosing the class with the highest probability
Bayesian network
Assumes that among a set of variables, some of the variables are made conditionally independent by conditioning on other variables while others cannot be made independent no matter what variables they are conditioned on
Unidentified
Multiple models fit the data the same
penalized regression
form of regularization (aimed at preventing overfitting)
Unlearnable system
generally if y and x are independent
No free lunch theorem
No optimal mapping that will cover every case
Bias-variance trade off
balancing error from overly simplistic assumptions (bias) and high sensitivity to training data (variance)
Acceptable error
System may not admit a predictive model that will be useful in practice given this
Bias
calculated as the squared difference between the expected prediction and true value
Variance
calculated as the expected squared deviation of a model's prediction from its own average
Error
variation in the data that is expected to be conditionally independent of the model
Kernel function
Symmetric function that takes in a pair of observations vectors and outputs a number
Ensemble method
Run multiple times with shallow trees and select many trees that perform well and combine them for the final decision task
Bagging
Bootstrapping of the sample, using the algorithm to identify a new tree for each sample and create the learner
Bootstrapping
Produce a new sample by sampling with replacement
Boosting
Fit an original tree, calculate residual error, and fit a new tree to the error and repeat
Neural network
Non-linear method, defines large numbers of non-linear vectors with some selection ability given data while combining information across many observed variables
In NN observations are
rows
In NN variables are
rows
First step of NN
Linear transformation of variables
Second step of NN
Application of activation function
Third step of NN
Linear transformation of outputs
Universal approximation theorem
A two layer neural network as the capacity of approximate any continuous (or discrete) function from an input set to arbitrary accuracy as long s the functions are non-linear
Objective function
Min or max of function indicates if the model is fit to the data
Back propagation algorithm
Derivation of gradient descent
Acceptable error
Depends on the task, advantage of having a lot of acceptable answers
Forward propagation
Calculates y predicted to get some loss knowing what y actual is
Tensors
Mathematical object that contains multilinear relationships between sets, used for transformations
Arithmetic logic unit
Mathematical/logical units
Control unit
Directs operationsC
Cache
Where data is stored temporarily
NN used GPUs
Parallel processing with the distribution of the NN, array structure
SoftMax
Generates probabilities from untransformed data, emphasizes high values and normalizes so that the sum of the vector is 1
Logits
Raw, unnormalized, real-valued scores produced by the final layer of a neural network
Cross entropy
loss calculation that takes into the account measuring the difference between two probability distributions, typically the true labels and predicted probabilities
ADAM
Algorithm to dictate step size: take bigger steps when slope is larger, smaller when closer
Feedforward
NN that inherits from previous layers
Assumptions for NN construction
Irreducible error is low and the acceptable error is high
Training data is representative of the entire possible dataset that could be observed
There is a lower dimensional set that can be learned/inferred from measured variables
There is a prior understanding of how measured variables are related to features of value for modeling outcomes
Deep learning
More than one layer in NN