1/146
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Independent Variables of Machine Learning
Inputs used to make predictions
Dependent Variables of Machine Learning
The target/results that are predicted
Simple Linear Regression
The type of regression that utilizes one variable to predict the target
Multiple Linear Regression
The type of regression that utilizes multiple variables to predict the target
What does "fitting a line" mean in linear regression?
Finding the line/hyperplane that minimizes prediction error between predicted and actual values
What type of variable does linear regression predict?
Continuous numerical variables
In linear regression why is a column of 1s added?
Added the intercept into the model
Why is the Normal Equation computationally expensive
It requires matrix inversion which can be very slow for data sets with a large number of features
When would you choose a gradient descent algorithm over a Normal Equation algorithm?
When datasets are very large or have a large number of features
What variable does logistic regression predict?
Binary class probabilities for classification
Purpose of the Sigmoid Function in Logistic Regression
Scales the output of the linear model into the probability range of [0,1]
Why is MSE not used in Logistic Regression?
It makes the cost function nonconvex and harder to optimize
What variable does softmax regression predict?
Multi class probabilities for classification
One-Hot Encoding
Representing categorical classes as binary vectors where only one values is one and the rest are 0s
Supervising Learning
Model is trained with labeled data
Unsupervised Learning
Model is trained with unlabeled data
Semi-Supervised Learning
Model is trained with small amounts of labeled data combined into a set with label data
Instance-Based Learning
Model that compares new examples to stored training instances. Saves training data for reference
Model-Based Learning
Model builds an equation/graph that is used to predict target. Does not save training data for reference
Underfitting
Model fails for both training and test data and is too simple for the given dataset and unable to learn feature-target relationships
Overfitting
Model succeeds with training data but fails at testing data and is overly complicated for the given dataset
Which model failure is poor at generalization?
Overfitting
How to fix underfitting?
Use a more complex model
What are the 3 main examples of bad data problems?
- Dataset has a lot of noise
- Dataset has a lot of outliers
- Dataset has a lot of missing features/values
Class Imbalance
One class dominates or appears more in a data set
Why is a high accuracy misleading in a heavily imbalanced data set?
A model can predict only the majority or be 100% wrong in this instance
K-Fold Cross Validation
Data is split into k mutual subsets and k number training/testing experiments are conducted
Why can't training data used in test?
It biases the model
Feature Engineering
Creating or transforming feature to improve performance
Examples of Feature Engineering
- Feature Creation
- Feature Selection
- Dimensionality Reduction
Feature Derivation
Creating new features from existing ones
When should a feature be removed?
- Feature has no correlation with target
- Feature is highly correlated/redundant to another feature
- All values in the feature are the same
- Over 60% of the values of the feature are missing
Best Plot for categorical variables
Histogram
Best plot to show correlation between continuous variables
Heatmap
Primary Question of EDA
Is my data ready for machine learning?
Consequences of skipping EDA
- Incorrect conclusions
- Overfitting
- Underfitting
- Issues with data remains undiscovered
- Wasting time
What sampling method preserves class ratios in the train/test split?
Stratified Sampling
Stratified Sampling
A type of probability sampling in which the population is divided into groups with a common attribute and a random sample is chosen within each group
Equal-Frequency Binning
Each bin contains the same number of observations
Main use of Gradient Descent
Minimizing the cost function during training
What is the role of partial derivatives in gradient descent?
They determine the direction to update model parameters
What does learning rate control in gradient descent?
The size of parameter updates during optimization
What happens if learning rate is too small?
Training becomes slow and does runtime becomes to long to reach the minimum
What happens if learning rate is too large?
Training overshoots the minimum or diverges (causes pattern of oscillation instead of convergence)
Why is feature scaling important in gradient descent?
It speeds up convergence and prevents uneven updates
Types of Gradient Descent
- Batch GD
- Stochastic GF
- Mini Batch GD
Batch Gradient Descent
Uses entire dataset in each step
Main Con of Batch GD
Slow for large datasets
Stochastic Gradient Descent
Uses one training example at a time making on training example one step
Pros of Stochastic GD
- Fast
- Works well with very large datasets
Con of Stochastic GD
- Noisy updates
Mini-Batch Descent
Uses small subsets of data for each step
Pros of Mini-Batch GD
- Efficient & stable
- Works well for large datasets
Use Case of Polynomial Regression
When variable relationships are nonlinear
Bias
An error due to overly simple model
Variance
Error due to model sensitivity to training data
What does bias and variance look like in the case of best generalization?
Balanced
Regularization
Techniques used to reduce overfitting by penalizing large model coefficients
What are the 3 regularized linear models?
- Ridge Regression
- Lasso Regression
- Elastic Net
Ridge Regression
Method of regularization by limiting the sum of the squares of the coefficients (aka L2 regularization). Shrinks coefficients but doesn't eliminate them
Lasso Regression
Uses L1 regularization which sets coefficients to zero then performs feature selection
Elastic Net
Uses a combination of L1 & L2 regularization which balances feature selection and coefficient shrinkage
What does a learning curve show?
Model Performance throughout training
Machine Learning
The process of training and algorithm to learn patterns from data to make predictions automatically
Components of a machine learning problem
- Task (T)
- Experience (E)
- Performance Measure (P)
Task
What a model is trying to do
Experience
The data a model learns from
Performance Measure
How a model is evaluated
What data does classification predict?
Categories
ML Workflow Steps
- Understand problem
- Collect data
- Perform EDA
- Prep & clean data
- Convert data to design matrix
- Train model
- Evaluate training and repeat last step if necessary
- Deploy model
Exploratory Data Analysis
The process of summarizing, visualizing, & understanding a data set before modeling
Nominal Data
Numbered categories with no order
Ordinal Data
Numbered categories with ordered ranking
Prediction Error in Linear Regression
The difference between an actual point in the model and a predicted point on the model line
Gradient Descent Process
- Start with random parameter values
- Calculate prediction error
- Compute Gradient
- Update parameters
- Repeat until convergence/until gradient = 0 & min is reached
Feature Scaling
Adjusting feature values to a common scale
Binary Classification
Classification with 2 classes
Multi-Class Classification
Classification with 3+ classes
Multi-Label Classification
One data type can have multiple labels Ex: movie genre
If there is a lot of bias in a model, what could we expect the model to have?
Underfitting
If there is a lot of variance in a model, what could we expect the model to have?
Overfitting
Reinforcement Learning
The training of machine learning models to make a sequence of decisions
What is commonly mistaken as a part of feature engineering but is not?
Replacing missing data
Pro of K-Fold Cross Validation
More reliable performance estimate
Single K-Fold Cross Validation
Why does a rule-based spam filter become difficult to maintain?
It creates a long list of complex rules that needs constant updating
Bootstrapping
A type of k-cross that uses multiple random Ks in multiple training steps that averaging the results
Min-Max Normalization Process
- Find min & max values
- Get range: max - min
- Plug in values (denoted by x or xi) into the formula to scale all values
Min-Max Normalization Formula
(xi - min) / range
Range of Min-Max Normalization
[0,1]
Analogy for Gradient Decent Optimization
A person moving downhill in a vlaley
How does the slope of Gradient Descent moves in relation to the cost function?
It will more in the opposite direction of the cost function
What is the Learning Rate in Gradient Descent
The size of steps taken during each iteration
If the learning rate is too small in GD what happens to convergence?
It converges too slowly and takes a long time to reach the local min
If the learning rate is too large in GD what happens to convergence?
It can skip the mini and diverge, usually in a pattern of oscillation with the convex function
Why is MSE cost function for linear regression used for GD
It is favorable for being convex
When both θ0 and θ1 vary, the cost function is best visualized using what plot?
3D Surface Plot
Contour Plot
Shows constant cost levels in 2D
If Gradient Descent starts at different initial points on a non-convex cost surface what conclusion will it have?
Different local mins will be reached
What determines the direction parameter update in Gradient Descent?
Derivatives