1/40
lecture 3-5
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is a machine learning pipeline?
A step-by-step process for solving problems with machine learning, from collecting data to deploying the final model.
What are the main stages of a machine learning pipeline?
Data collection • Data processing • Feature engineering • Model selection • Model training • Model evaluation • Model deployment • Maintenance
name each stage and the reason for it
Data Collection The process of gathering and organizing the dataset that will be used to train and evaluate your machine learning model.
Data Processing Cleaning, transforming, and preparing raw data by handling missing values, encoding categorical variables, and scaling numerical features.
Feature Engineering Creating new features from existing ones to improve model performance by incorporating domain knowledge and identifying relevant patterns.
Model Selection Choosing an appropriate algorithm based on the problem type, data characteristics, and desired outcomes.
Model Training Feeding the training data to the selected algorithm to learn patterns and relationships in the data.
Model Evaluation Assessing model performance using appropriate metrics and validation techniques to determine how well it will generalize to new data.
Model Deployment Implementing the trained model in a production environment where it can make predictions on new data.
Maintenance Regularly monitoring model performance, retraining with new data, and updating it as needed when data patterns or requirements change.
What insights do descriptive statistics provide?
They show data averages, spread, outliers, and missing values to help understand what you're working with.
What does a correlation matrix tell you?
It shows how strongly features are related to each other, helping identify which ones might be useful for prediction.
How do you interpret correlation values?
Near +1: strong positive relationship, near -1: strong negative relationship, near 0: little or no relationship.
What is feature engineering?
Creating new features from existing ones to help the model make better predictions.
Why is feature engineering useful?
It can improve model performance by creating features that better capture important patterns in the data.
What should guide feature creation?
Domain knowledge, relationships with the target, and adding information the model can't discover on its own.
What are three ways to handle missing values?
Remove rows with missing data, remove entire columns, or fill in missing values (imputation).
What is imputation?
Filling in missing values with estimates like the average, middle value, or zero.
What's the difference between categorical and ordinal data?
Categorical data has no natural order (like colors), while ordinal data has a clear order (like ratings from poor to excellent).
What is one-hot encoding and when to use it?
It turns categories into separate yes/no columns. Use it for categorical data without order to prevent the model from seeing false relationships.
Why scale features?
To put features on similar scales so that larger-valued features don't dominate smaller ones in the model.
What is normalization (min-max scaling)?
Rescaling values to fall between 0 and 1, making them easier to compare.
What is standardization?
Rescaling values to have an average of 0 and a spread of 1, useful when data has a bell-curve shape.
When to use standardization instead of normalization?
When data has outliers or when using algorithms that assume data follows a bell curve.
Which algorithms need feature scaling most?
Algorithms that use distances or gradients, like k-NN, SVM, and neural networks.
What is binning and why use it?
Grouping similar values together to reduce noise and smooth out random fluctuations in the data.
What are two ways to create bins?
Equal-width bins (same value range per bin) or equal-frequency bins (same number of samples per bin).
When might binning be harmful?
When it removes important variations in the data that could help with predictions.
Why split data into training and testing sets?
To see if the model can make good predictions on new data it hasn't seen before.
What problems can manual test selection cause?
Biased test sets that don't represent the real data, leading to misleading performance estimates.
What is stratified sampling?
Making sure important groups are represented in the same proportions in both training and test sets.
When use stratified instead of random sampling?
When some groups are rare or when certain categories need equal representation in both sets.
What is Mean Absolute Error (MAE)?
The average size of prediction errors, ignoring whether they're positive or negative.
What is Root Mean Square Error (RMSE)?
A measure that penalizes large errors more heavily by squaring them before averaging.
How do MAE and RMSE differ?
RMSE punishes large errors more, while MAE treats all error sizes more equally.
When prefer RMSE over MAE?
When large errors are more problematic than small ones, and outliers are rare.
What is overfitting?
When a model works well on training data but poorly on new data because it learned the noise, not just the pattern.
What is underfitting?
When a model is too simple to capture the important patterns in the data, performing poorly on both training and test data.
What is the bias-variance tradeoff?
Balancing between a model that's too simple (high bias) and one that's too complex (high variance).
What is k-fold cross-validation?
Splitting data into k parts, then training k different models, each using a different part as the test set and the rest for training.
Why use cross-validation?
To get a more reliable performance estimate by testing on all the data, and to see how consistent the model's performance is.
What's good about Linear Regression?
It's simple to understand and explain, showing clear relationships between inputs and outputs.
What's a problem with Decision Trees?
They tend to memorize the training data too well, leading to poor performance on new data.
Why is Random Forest often better than a single Decision Tree?
It combines many trees trained on different subsets of data, averaging out their mistakes.
What is regularization?
Adding constraints to prevent the model from becoming too complex and overfitting.
What is model drift?
When a model's performance gets worse over time because the patterns in the data have changed.
Why maintain models after deployment?
To make sure they stay accurate as data patterns change over time.
When should a model be retrained?
When its performance drops, when the data changes significantly, or when the problem definition changes.