1/66
ML
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is the fundamental concept of machine learning?
Teaching computers to perform tasks using data rather than explicit programming
In machine learning, a model is defined with parameters that are optimized using what?
Training data or past experience
What are the two general purposes a machine learning model can serve?
It can be predictive (making future predictions) or descriptive (gaining knowledge from data)
According to the source, what is the primary role of machine learning in the modern data landscape?
To enable computers to learn from data to aid knowledge discovery, transforming raw data into insights through pattern recognition.
What are the three main branches of machine learning identified in the lecture?
Supervised learning, Unsupervised learning, and Reinforcement learning.
What is the core principle of supervised learning?
It involves learning from labeled data to predict outputs for new inputs.
In supervised learning, given a dataset D = {Xi, Yi}, what is the goal?
To learn a function F that maps inputs to outputs, represented as F: X → Y
What is the primary use case for supervised learning in industry?
To automate manual labor by learning a model from a manually annotated subset of data to annotate the rest.
Supervised learning is divided into two main types
Classification and Regression
What type of output does a classification model predict?
Discrete outputs or categories.
What is a key requirement for training a supervised classification model?
Historical data with both features and corresponding labels.
What is an example of a classification task in the financial industry?
Credit scoring, which determines if customers are low-risk based on attributes like income and savings.
In the context of supervised learning, what task involves identifying diseases from patient information?
Medical diagnosis (a classification task).
What is an example of a classification task involving images of text?
Optical character recognition (recognizing handwritten digits or characters).
What type of output does a regression model predict?
Continuous outputs or numerical values.
What is an example of a regression task related to vehicles?
Predicting car prices based on mileage and other features.
In navigation systems, predicting the steering angle is an example of what type of supervised learning?
Regression.
What is the primary goal of unsupervised learning?
To discover patterns, structures, or groupings in unlabeled data.
The process of grouping similar items together without predefined labels, such as in customer segmentation, is known as _.
Clustering
What is a common application of unsupervised learning in cybersecurity and finance?
Outlier detection, used to identify fraud or security threats.
How can unsupervised learning be used for data storage efficiency?
Through image compression, by grouping similar colors to reduce storage requirements.
What is the objective of reinforcement learning?
To learn optimal action sequences in an environment to maximize cumulative rewards.
What are the three core components in a reinforcement learning problem?
Environment (e), actions (a), and rewards (r).
Reinforcement learning is particularly suited for scenarios requiring _ under uncertainty.
sequential decision-making
What is the first stage of the typical machine learning workflow?
Data ingestion: collecting raw data from various sources.
What is the second stage of the machine learning workflow, occurring after data ingestion?
Data cleansing / transformation: preprocessing and cleaning data.
After data is cleaned and transformed, what is the next step in the ML workflow?
Model training: building the model on the training data.
How is a model's performance evaluated in the machine learning workflow?
During the model testing stage, using the test data.
What is the final stage of the initial ML workflow before continuous improvement begins?
Model deployment: implementing the model in production.
What mechanism ensures continuous improvement in a deployed machine learning model?
A feedback loop based on the model's ongoing performance.
How are images typically represented for processing by a computer in machine learning?
As arrays of RGB values for each pixel.
How must text data be prepared before it can be used in a machine learning model?
It must be converted into a numerical format through tokenization and encoding.
What is the purpose of the Standard scaler (Z-scores) in data preprocessing?
To transform data to have a mean of 0 and a standard deviation of 1.
What is the formula for the Standard scaler transformation?
It is calculated as (X - μ) / σ, where μ is the mean and σ is the standard deviation.
Under what condition does the Standard scaler perform best?
It works well for data that is not skewed (i.e., closely Gaussian distributed).
What makes the Robust scaler more effective than the Standard scaler for skewed data with outliers?
It uses the median instead of the mean and the interquartile range instead of the standard deviation.
Which scaling technique shifts data to a specific interval, typically between 0 and 1?
Min-max scaler.
What is the formula for the Min-max scaler?
It is calculated as (X - X{min}) / (X{max} - X_{min}).
How does the Normalizer scaler differ from other scalers like Standard or Min-max?
It works row-wise rather than column-wise, rescaling each row's norm to 1.
When is the Normalizer scaler particularly useful?
When the direction of the data matters more than the magnitude, such as with histograms.
What is the goal of using univariate mathematical transformations like log or Box-Cox?
To adjust feature scales and make their distributions more closely Gaussian (bell-shaped).
For what type of data is a logarithmic transformation particularly useful?
Counting data that has many small values and a few large outliers.
If data contains zeros, how should a log transformation be applied to avoid errors?
By using the formula log(X + 1), since the logarithm is undefined at zero.
What do the Box-Cox and Yeo-Johnson transformations do automatically?
They estimate parameters to minimize skewness and stabilize variance, converting distributions to be near-Gaussian.
What is the process of separating continuous feature values into a number of categories called?
Binning or discretization.
How does binning affect linear models?
It increases their flexibility by allowing the model to learn different predictions for each bin.
Why does binning typically have no beneficial effect for tree-based models?
Because tree-based models can already learn to split the data anywhere along a feature's range.
What is the primary reason for adding interaction and polynomial features to a dataset?
To enrich the feature representation, especially for linear models.
Why is it crucial to split a dataset into training and testing sets?
To avoid overfitting and to properly evaluate how well the model generalizes to new, unseen data.
A model that is too simple and fails to capture the underlying patterns in the data is said to be _.
Underfitting
What is overfitting?
When a model is too complex and memorizes the training data, leading to poor generalization on new data.
What is the 'sweet spot' in model training?
The optimal model complexity that balances high training accuracy with good generalization performance.
For datasets with multiple samples per subject, what splitting method should be used to prevent data leakage?
Subject-wise splitting, ensuring all data from one subject is in either the training or test set, but not both.
What is the purpose of cross-validation?
To evaluate a model's ability to predict new data and to detect issues like overfitting or selection bias.
In ___ , the dataset is divided into subsets, and each subset is used as the test set once.
K-fold cross-validation
What is Leave-one-out cross-validation?
An extreme form of K-fold cross-validation where K is equal to the total number of samples in the dataset.
What is the purpose of using DummyClassifier and DummyRegression models?
They serve as simple baselines to compare against more complex models, using strategies like predicting the most frequent class or the mean.
What is a machine learning pipeline?
An integrated workflow that executes a sequence of tasks, such as scaling, training, and prediction.
What is a key benefit of using a machine learning pipeline for preprocessing?
It ensures consistent transformations are applied to both training and testing data, preventing data leakage.
In ML terminology, what is an 'instance'?
A single data point or example in a dataset.
What are 'features' or 'attributes' in machine learning?
The descriptive characteristics or properties of an instance.
The target variable or output that a model tries to predict is known as the _.
Label or class
What is a 'feature vector'?
A collection of all the feature values for a single instance.
Term: Model
An equation or system that links feature values to predicted target values.
What are 'score functions' or 'fit statistics' used for?
To measure how well a model fits the data.
What is the goal of feature selection?
To reduce the number of predictors by selecting only the most important ones, which is a form of dimensionality reduction.
What is feature extraction?
Transforming data into a new, often lower-dimensional feature space through mathematical operations like PCA.