Q: What is NumPy?
A: The primary library for numerical manipulation in Python, especially for matrix operations. It allows for creation and manipulation of arrays and provides high-level mathematical functions.
Q: What is Matplotlib?
A: A library used for data visualization in Python that offers numerous options for creating plots, charts, and other visualizations.
Q: What is Pandas?
A: A powerful tool for data manipulation and analysis, especially with large datasets. It enables converting datasets into data frames for easy manipulation and can read data from various file formats.
Q: What is Scikit-learn?
A: A crucial library for machine learning algorithms and tools in Python. It includes a wide range of ML algorithms, often incorporates recent algorithms developed by academics, and provides tools for various steps in the ML pipeline.
Q: What functions in Pandas help with initial data exploration?
A:
head()
: Prints the first five elements of a data frame
info()
: Lists all attributes and data types
describe()
: Generates basic statistics for each attribute
Q: What is supervised learning?
A: A type of machine learning where the model is trained on labeled data, meaning the data includes both the input features and the desired output (target).
Q: What is regression in machine learning?
A: A type of supervised learning that aims at predicting numerical/continuous values (as opposed to classification, which predicts categorical values).
Q: What are the main steps in the Machine Learning Pipeline?
A:
Data Collection
Understanding the Data
Data Cleaning
Model Selection
Model Training
Model Deployment
Maintenance
Q: What is feature engineering?
A: Creating new features from existing ones to improve model performance. Example: creating "rooms per household" by dividing total rooms by number of households.
Q: What is a correlation coefficient and what is its range?
A: A measure that quantifies the relationship between variables. Standard (Pearson's) correlation ranges from -1 to +1, where:
Values near +1 indicate strong positive correlation
Values near -1 indicate strong negative correlation
Values near 0 indicate weak or no correlation
Q: What is a scatter matrix and what does it show?
A: A visual representation of the correlation matrix. It shows scatter plots for pairs of variables, histograms for individual variables, and helps identify relationships and anomalies in the data.
Q: What are the main strategies for handling missing values?
A:
Removing records with missing values
Removing the entire column
Replacing missing values with a calculated value (zero, mean, median, etc.)
Q: What is the difference between categorical and ordinal data?
A:
Categorical/Nominal data: Data with no inherent order (e.g., hair color)
Ordinal data: Data with a specific order (e.g., bad, average, good)
Q: What is one-hot encoding and when is it used?
A: A technique for creating binary vectors (new columns) for each category in categorical data. It's used when there are independent categories with no order, to convert categorical variables into a format that works better with machine learning algorithms.
Q: What is feature scaling and why is it important?
A: Transforming numerical features to have a similar scale. It's important when input values have very different ranges because it makes it easier to compare features and can speed up the convergence of optimization algorithms.
Q: What is Min-Max Scaling (Normalization) and what's its formula?
A: A scaling technique where values are shifted and scaled to range from 0 to 1. Formula: X′ = (X - X_min) / (X_max - X_min)
Q: What is Standardization (Z-score) and what's its formula?
A: A scaling technique where values are centered around the mean with unit variance. Formula: X′ = (X - X_mean) / σ Unlike normalization, values are not bounded to a specific range and it's less affected by outliers.
Q: Why do we split data into training and testing sets?
A: To evaluate how well our model generalizes to new, unseen data. The training set is used to train the model, while the testing set is used to evaluate its performance.
Q: What is stratified sampling and when should it be used?
A: A technique for grouping data into subgroups (strata) and sampling from each stratum. It should be used when the dataset has imbalanced features that are important for prediction, to ensure representation across all important feature categories.
Q: What is Mean Absolute Error (MAE) and what's its formula?
A: A metric measuring the average of absolute differences between predictions and actual values. Formula: MAE(X,h) = (1/m) Σ|h(x_i) - y(i)| It corresponds to the Manhattan norm (L1 norm) and is less sensitive to outliers.
Q: What is Root Mean Square Error (RMSE) and what's its formula?
A: A metric measuring the square root of the average of squared differences between predictions and actual values. Formula: RMSE(X,h) = √[(1/m) Σ(h(x_i) - y(i))²] It corresponds to the Euclidean norm (L2 norm) and is more sensitive to outliers but generally preferred when outliers are rare.
Q: Compare MAE and RMSE in terms of sensitivity to outliers.
A:
MAE (Mean Absolute Error) is less sensitive to outliers since it uses absolute differences
RMSE (Root Mean Square Error) is more sensitive to outliers because squaring the differences amplifies larger errors
RMSE is generally preferred when outliers are rare, while MAE might be better when the data contains significant outliers
Q: What is model deployment?
A: The process of putting a trained model into production to make predictions on new data. It can involve retraining the model with new data, and the model selection can depend on the deployment approach.
Q: What is model maintenance?
A: Ongoing work to keep the model performing well over time, which includes addressing data set changes and model drift, and retraining models to maintain accuracy.
Q: What is model drift and why is it important?
A: The degradation of model performance due to changes in the data distribution over time. It's important because it affects the model's ability to make accurate predictions, requiring monitoring and periodic retraining.
Q: What is Jupyter Notebook?
A: An environment for writing Python code and experimenting with data analysis and machine learning that allows for combining code, annotations, and outputs in a single document. It operates through a web browser and supports multiple kernels.
Q: What is SimpleImputer in scikit-learn used for?
A: A tool provided by scikit-learn to replace missing values in a dataset using strategies like mean, median, most frequent value, or constant.
Q: What scikit-learn function is used for splitting data into training and testing sets?
A: train_test_split
function, typically used with a test_size parameter (e.g., 0.2 for a 80%-20% split).
Q: What scikit-learn class is used for stratified sampling?
A: StratifiedShuffleSplit
class, which preserves the percentage of samples for each class in both training and test sets.
Q: What scikit-learn classes are used for feature scaling?
A:
MinMaxScaler
: For normalization (scaling to a 0-1 range)
StandardScaler
: For standardization (centering around mean with unit variance)
Q: What is the trade-off between bias and variance in model selection?
A: A fundamental challenge in ML where:
High bias models (too simple) underfit the data and miss important patterns
High variance models (too complex) overfit the data and capture noise
The goal is to find the right balance for the specific problem
Q: What is cross-validation and why is it important?
A: A resampling method that evaluates model performance by dividing the data into multiple subsets and training/testing on different combinations. It provides a more reliable estimate of model performance than a single train-test split, especially with limited data.
Q: What's the difference between L1 and L2 regularization?
A:
L1 (Lasso): Adds absolute value of coefficients to loss function, can produce sparse models by driving some coefficients to zero
L2 (Ridge): Adds squared value of coefficients to loss function, tends to distribute weight values more evenly
Both help prevent overfitting but with different characteristics