HS

Machine Learning Pipeline and Tools

Python Libraries

Card 1

Q: What is NumPy?

A: The primary library for numerical manipulation in Python, especially for matrix operations. It allows for creation and manipulation of arrays and provides high-level mathematical functions.

Card 2

Q: What is Matplotlib?

A: A library used for data visualization in Python that offers numerous options for creating plots, charts, and other visualizations.

Card 3

Q: What is Pandas?

A: A powerful tool for data manipulation and analysis, especially with large datasets. It enables converting datasets into data frames for easy manipulation and can read data from various file formats.

Card 4

Q: What is Scikit-learn?

A: A crucial library for machine learning algorithms and tools in Python. It includes a wide range of ML algorithms, often incorporates recent algorithms developed by academics, and provides tools for various steps in the ML pipeline.

Card 5

Q: What functions in Pandas help with initial data exploration?

A:

  • head(): Prints the first five elements of a data frame

  • info(): Lists all attributes and data types

  • describe(): Generates basic statistics for each attribute

Machine Learning Basics

Card 6

Q: What is supervised learning?

A: A type of machine learning where the model is trained on labeled data, meaning the data includes both the input features and the desired output (target).

Card 7

Q: What is regression in machine learning?

A: A type of supervised learning that aims at predicting numerical/continuous values (as opposed to classification, which predicts categorical values).

Card 8

Q: What are the main steps in the Machine Learning Pipeline?

A:

  1. Data Collection

  2. Understanding the Data

  3. Data Cleaning

  4. Model Selection

  5. Model Training

  6. Model Deployment

  7. Maintenance

Data Understanding & Feature Engineering

Card 9

Q: What is feature engineering?

A: Creating new features from existing ones to improve model performance. Example: creating "rooms per household" by dividing total rooms by number of households.

Card 10

Q: What is a correlation coefficient and what is its range?

A: A measure that quantifies the relationship between variables. Standard (Pearson's) correlation ranges from -1 to +1, where:

  • Values near +1 indicate strong positive correlation

  • Values near -1 indicate strong negative correlation

  • Values near 0 indicate weak or no correlation

Card 11

Q: What is a scatter matrix and what does it show?

A: A visual representation of the correlation matrix. It shows scatter plots for pairs of variables, histograms for individual variables, and helps identify relationships and anomalies in the data.

Data Cleaning & Preparation

Card 12

Q: What are the main strategies for handling missing values?

A:

  • Removing records with missing values

  • Removing the entire column

  • Replacing missing values with a calculated value (zero, mean, median, etc.)

Card 13

Q: What is the difference between categorical and ordinal data?

A:

  • Categorical/Nominal data: Data with no inherent order (e.g., hair color)

  • Ordinal data: Data with a specific order (e.g., bad, average, good)

Card 14

Q: What is one-hot encoding and when is it used?

A: A technique for creating binary vectors (new columns) for each category in categorical data. It's used when there are independent categories with no order, to convert categorical variables into a format that works better with machine learning algorithms.

Card 15

Q: What is feature scaling and why is it important?

A: Transforming numerical features to have a similar scale. It's important when input values have very different ranges because it makes it easier to compare features and can speed up the convergence of optimization algorithms.

Card 16

Q: What is Min-Max Scaling (Normalization) and what's its formula?

A: A scaling technique where values are shifted and scaled to range from 0 to 1. Formula: X′ = (X - X_min) / (X_max - X_min)

Card 17

Q: What is Standardization (Z-score) and what's its formula?

A: A scaling technique where values are centered around the mean with unit variance. Formula: X′ = (X - X_mean) / σ Unlike normalization, values are not bounded to a specific range and it's less affected by outliers.

Model Training & Evaluation

Card 18

Q: Why do we split data into training and testing sets?

A: To evaluate how well our model generalizes to new, unseen data. The training set is used to train the model, while the testing set is used to evaluate its performance.

Card 19

Q: What is stratified sampling and when should it be used?

A: A technique for grouping data into subgroups (strata) and sampling from each stratum. It should be used when the dataset has imbalanced features that are important for prediction, to ensure representation across all important feature categories.

Card 20

Q: What is Mean Absolute Error (MAE) and what's its formula?

A: A metric measuring the average of absolute differences between predictions and actual values. Formula: MAE(X,h) = (1/m) Σ|h(x_i) - y(i)| It corresponds to the Manhattan norm (L1 norm) and is less sensitive to outliers.

Card 21

Q: What is Root Mean Square Error (RMSE) and what's its formula?

A: A metric measuring the square root of the average of squared differences between predictions and actual values. Formula: RMSE(X,h) = √[(1/m) Σ(h(x_i) - y(i))²] It corresponds to the Euclidean norm (L2 norm) and is more sensitive to outliers but generally preferred when outliers are rare.

Card 22

Q: Compare MAE and RMSE in terms of sensitivity to outliers.

A:

  • MAE (Mean Absolute Error) is less sensitive to outliers since it uses absolute differences

  • RMSE (Root Mean Square Error) is more sensitive to outliers because squaring the differences amplifies larger errors

  • RMSE is generally preferred when outliers are rare, while MAE might be better when the data contains significant outliers

Model Deployment & Maintenance

Card 23

Q: What is model deployment?

A: The process of putting a trained model into production to make predictions on new data. It can involve retraining the model with new data, and the model selection can depend on the deployment approach.

Card 24

Q: What is model maintenance?

A: Ongoing work to keep the model performing well over time, which includes addressing data set changes and model drift, and retraining models to maintain accuracy.

Card 25

Q: What is model drift and why is it important?

A: The degradation of model performance due to changes in the data distribution over time. It's important because it affects the model's ability to make accurate predictions, requiring monitoring and periodic retraining.

Programming Tools

Card 26

Q: What is Jupyter Notebook?

A: An environment for writing Python code and experimenting with data analysis and machine learning that allows for combining code, annotations, and outputs in a single document. It operates through a web browser and supports multiple kernels.

Card 27

Q: What is SimpleImputer in scikit-learn used for?

A: A tool provided by scikit-learn to replace missing values in a dataset using strategies like mean, median, most frequent value, or constant.

Card 28

Q: What scikit-learn function is used for splitting data into training and testing sets?

A: train_test_split function, typically used with a test_size parameter (e.g., 0.2 for a 80%-20% split).

Card 29

Q: What scikit-learn class is used for stratified sampling?

A: StratifiedShuffleSplit class, which preserves the percentage of samples for each class in both training and test sets.

Card 30

Q: What scikit-learn classes are used for feature scaling?

A:

  • MinMaxScaler: For normalization (scaling to a 0-1 range)

  • StandardScaler: For standardization (centering around mean with unit variance)

Advanced Concepts

Card 31

Q: What is the trade-off between bias and variance in model selection?

A: A fundamental challenge in ML where:

  • High bias models (too simple) underfit the data and miss important patterns

  • High variance models (too complex) overfit the data and capture noise

  • The goal is to find the right balance for the specific problem

Card 32

Q: What is cross-validation and why is it important?

A: A resampling method that evaluates model performance by dividing the data into multiple subsets and training/testing on different combinations. It provides a more reliable estimate of model performance than a single train-test split, especially with limited data.

Card 33

Q: What's the difference between L1 and L2 regularization?

A:

  • L1 (Lasso): Adds absolute value of coefficients to loss function, can produce sparse models by driving some coefficients to zero

  • L2 (Ridge): Adds squared value of coefficients to loss function, tends to distribute weight values more evenly

  • Both help prevent overfitting but with different characteristics