Machine Learning Pipeline

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/40

flashcard set

Earn XP

Description and Tags

lecture 3-5

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

41 Terms

1
New cards

What is a machine learning pipeline?

A step-by-step process for solving problems with machine learning, from collecting data to deploying the final model.

2
New cards

What are the main stages of a machine learning pipeline?

Data collection • Data processing • Feature engineering • Model selection • Model training • Model evaluation • Model deployment • Maintenance

3
New cards

name each stage and the reason for it

Data Collection The process of gathering and organizing the dataset that will be used to train and evaluate your machine learning model.

Data Processing Cleaning, transforming, and preparing raw data by handling missing values, encoding categorical variables, and scaling numerical features.

Feature Engineering Creating new features from existing ones to improve model performance by incorporating domain knowledge and identifying relevant patterns.

Model Selection Choosing an appropriate algorithm based on the problem type, data characteristics, and desired outcomes.

Model Training Feeding the training data to the selected algorithm to learn patterns and relationships in the data.

Model Evaluation Assessing model performance using appropriate metrics and validation techniques to determine how well it will generalize to new data.

Model Deployment Implementing the trained model in a production environment where it can make predictions on new data.

Maintenance Regularly monitoring model performance, retraining with new data, and updating it as needed when data patterns or requirements change.

4
New cards

What insights do descriptive statistics provide?

They show data averages, spread, outliers, and missing values to help understand what you're working with.

5
New cards

What does a correlation matrix tell you?

It shows how strongly features are related to each other, helping identify which ones might be useful for prediction.

6
New cards

How do you interpret correlation values?

Near +1: strong positive relationship, near -1: strong negative relationship, near 0: little or no relationship.

7
New cards

What is feature engineering?

Creating new features from existing ones to help the model make better predictions.

8
New cards

Why is feature engineering useful?

It can improve model performance by creating features that better capture important patterns in the data.

9
New cards

What should guide feature creation?

Domain knowledge, relationships with the target, and adding information the model can't discover on its own.

10
New cards

What are three ways to handle missing values?

Remove rows with missing data, remove entire columns, or fill in missing values (imputation).

11
New cards

What is imputation?

Filling in missing values with estimates like the average, middle value, or zero.

12
New cards

What's the difference between categorical and ordinal data?

Categorical data has no natural order (like colors), while ordinal data has a clear order (like ratings from poor to excellent).

13
New cards

What is one-hot encoding and when to use it?

It turns categories into separate yes/no columns. Use it for categorical data without order to prevent the model from seeing false relationships.

14
New cards

Why scale features?

To put features on similar scales so that larger-valued features don't dominate smaller ones in the model.

15
New cards

What is normalization (min-max scaling)?

Rescaling values to fall between 0 and 1, making them easier to compare.

16
New cards

What is standardization?

Rescaling values to have an average of 0 and a spread of 1, useful when data has a bell-curve shape.

17
New cards

When to use standardization instead of normalization?

When data has outliers or when using algorithms that assume data follows a bell curve.

18
New cards

Which algorithms need feature scaling most?

Algorithms that use distances or gradients, like k-NN, SVM, and neural networks.

19
New cards

What is binning and why use it?

Grouping similar values together to reduce noise and smooth out random fluctuations in the data.

20
New cards

What are two ways to create bins?

Equal-width bins (same value range per bin) or equal-frequency bins (same number of samples per bin).

21
New cards

When might binning be harmful?

When it removes important variations in the data that could help with predictions.

22
New cards

Why split data into training and testing sets?

To see if the model can make good predictions on new data it hasn't seen before.

23
New cards

What problems can manual test selection cause?

Biased test sets that don't represent the real data, leading to misleading performance estimates.

24
New cards

What is stratified sampling?

Making sure important groups are represented in the same proportions in both training and test sets.

25
New cards

When use stratified instead of random sampling?

When some groups are rare or when certain categories need equal representation in both sets.

26
New cards

What is Mean Absolute Error (MAE)?

The average size of prediction errors, ignoring whether they're positive or negative.

27
New cards

What is Root Mean Square Error (RMSE)?

A measure that penalizes large errors more heavily by squaring them before averaging.

28
New cards

How do MAE and RMSE differ?

RMSE punishes large errors more, while MAE treats all error sizes more equally.

29
New cards

When prefer RMSE over MAE?

When large errors are more problematic than small ones, and outliers are rare.

30
New cards

What is overfitting?

When a model works well on training data but poorly on new data because it learned the noise, not just the pattern.

31
New cards

What is underfitting?

When a model is too simple to capture the important patterns in the data, performing poorly on both training and test data.

32
New cards

What is the bias-variance tradeoff?

Balancing between a model that's too simple (high bias) and one that's too complex (high variance).

33
New cards

What is k-fold cross-validation?

Splitting data into k parts, then training k different models, each using a different part as the test set and the rest for training.

34
New cards

Why use cross-validation?

To get a more reliable performance estimate by testing on all the data, and to see how consistent the model's performance is.

35
New cards

What's good about Linear Regression?

It's simple to understand and explain, showing clear relationships between inputs and outputs.

36
New cards

What's a problem with Decision Trees?

They tend to memorize the training data too well, leading to poor performance on new data.

37
New cards

Why is Random Forest often better than a single Decision Tree?

It combines many trees trained on different subsets of data, averaging out their mistakes.

38
New cards

What is regularization?

Adding constraints to prevent the model from becoming too complex and overfitting.

39
New cards

What is model drift?

When a model's performance gets worse over time because the patterns in the data have changed.

40
New cards

Why maintain models after deployment?

To make sure they stay accurate as data patterns change over time.

41
New cards

When should a model be retrained?

When its performance drops, when the data changes significantly, or when the problem definition changes.