Machine Learning

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/82

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

83 Terms

New cards

What is machine learning?

Teaching a computer to do a task using “data” (ex. examples of what we want it to do)

ex. trying to distinguish between a greyhound vs labrador
- Create Dataset: Collect some examples of Greyhounds and Labradors
- Collect photos of dogs in each breed
- Describe each dog to the computer
- Height of specific dog *
- Weight
- Color

*machine learning doesn’t predict randomness but instead learns from patterns in the data + learning from experience

exposure to more data increases learning quality → optimization issue (trying to minimize error)

<p><u>Teaching a computer to do a task using “data”</u> (ex. examples of what we want it to do)</p><ul><li><p>ex. trying to distinguish between a greyhound vs labrador</p><ul><li><p>Create Dataset: Collect some examples of Greyhounds and Labradors</p></li><li><p>Collect photos of dogs in each breed</p></li><li><p>Describe each dog to the computer</p></li><li><p>Height of specific dog *</p></li><li><p>Weight</p></li><li><p>Color</p><p></p></li></ul></li></ul><p>*machine learning doesn’t predict randomness but instead learns from patterns in the data + learning from experience</p><ul><li><p>exposure to more data increases learning quality → optimization issue (trying to minimize error)</p></li></ul><p></p>

New cards

How do you increase the prediction quality in machine learning - distinguishing between a labrador vs greyhound?

Adding more features increases the prediction quality

too many features = exhaust resources
too little features = more errors

feature value = ex. 68 years, male

target value = ex. has diabetes

feature = ex. age

Decision boundary - cut off threshold

<p><u>Adding more features</u> increases the prediction quality</p><ul><li><p>too many features = exhaust resources</p></li><li><p>too little features = more errors</p></li></ul><p></p><p>feature value = ex. 68 years, male</p><p>target value = ex. has diabetes</p><p>feature = ex. age</p><p><strong>Decision boundary</strong> - cut off threshold</p>

New cards

Why did machine learning become more relevant?

The evolution of digital data → huge expansion

used to have structured data → highly organized, excel spreadsheets..
now we have unstructured data → emails, text and video files, pictures, satellite data…

Volume of data collected grows daily

More than 90% of the data in this world within the past two years!!!
Data is cheap and abundant but ... knowledge is expensive and scare
To make sense of all the unstructured data: we need knowledge discovery

Machine learning: computers learn from data to aid knowledge discovery

<p><u>The evolution of digital data</u> → huge expansion</p><ul><li><p>used to have structured data → highly organized, excel spreadsheets..</p></li><li><p>now we have <strong>unstructured data</strong> → emails, text and video files, pictures, satellite data…</p></li></ul><p></p><p>Volume of data collected grows daily</p><ul><li><p>More than 90% of the data in this world within the past two years!!!</p></li><li><p>Data is cheap and abundant but ... knowledge is expensive and scare</p></li><li><p>To make sense of all the unstructured data: we need <strong>knowledge discovery</strong></p></li></ul><ul><li><p><strong><u>Machine learning:</u></strong><u> computers learn from data to aid knowledge discovery</u></p></li></ul><p></p>

New cards

What are the 3 main branches of machine learning?

supervised learning: have data and labels, model predicts a label for new data
unsupervised learning: might not have labels, bunch of cat pics but no labels, without any prior labeling
reinforcement learning: reward system, maximize reward, system learns by doing

<ul><li><p><strong>supervised learning</strong>: have data and labels, model predicts a label for new data</p></li><li><p><strong>unsupervised learning</strong>: might not have labels, bunch of cat pics but no labels, without any prior labeling</p></li><li><p><strong>reinforcement learning</strong>: reward system, maximize reward, system learns by doing</p></li></ul><p></p>

New cards

Describe Supervised learning

Given a set of data and labels, a model learns to predict a label for new data, labels

• Given D = {Xi,Yi} learn a model (or function) F: Xk → Yk

Often used to automate manual labor
- ex. you might annotate part of a dataset manually, then learn a machine learning model from these annotations, and use the model to annotate the rest of your data

examples:

Given a satellite image, what is the terrain in the image?
- Xi = pixels (image regions), Yi = terrain type
Given some test results from a patient, will the patient have diabetes?
- Xi = test results, Yi = diabetes/no diabetes

<p>Given a set of data and labels, a model learns to predict a label for new data, <strong>labels</strong></p><p>• Given D = {Xi,Yi} learn a model (or function) F: Xk → Yk</p><ul><li><p>Often used to automate manual labor</p><ul><li><p>ex. you might annotate part of a dataset manually, then learn a machine learning model from these annotations, and use the model to annotate the rest of your data</p></li></ul></li></ul><p><u>examples</u>:</p><ul><li><p>Given a satellite image, what is the terrain in the image?</p><ul><li><p>Xi = pixels (image regions), Yi = terrain type</p></li></ul></li><li><p>Given some test results from a patient, will the patient have diabetes?</p><ul><li><p>Xi = test results, Yi = diabetes/no diabetes</p></li></ul></li></ul><p></p>

New cards

Describe Unsupervised learning

Discover patterns in data, no labels

Given D = {Xi} group the data into Y classes using a model (or function)
F: Xi → Yj

examples:

Discovering trending topics on Twitter or in the news
Grouping data into clusters for easier analysis
Outlier detection (ex. Fraud detection and security systems)

New cards

Describe Reinforcement learning

Reasoning under uncertainty to make optimal decisions, reward mechanism, optimization of minimizing error

how agents should to take actions in an environment to maximize some reward.
Given D = {environment (e), actions (a), rewards (r)} learn a policy and utility functions:
- policy: F1: {e,r} → a
- utility: F2: {a,e} → r

New cards

Describe Machine Learning workflow

user gets data, you choose algorithm and clean data, include and exclude data, split label data into training set test set, learn the model, test learned relationship on unseen data

Pre-processing data is related to which model you use

New cards

What is a model and score functions/score metrics?

A model: an equation that links the values of some features to the predicted value of the target variable; finding the equation (and coefficients in it) is called ‘building a model’ (see also ‘fitting a model’).
Score functions/Fit statistics/Score metrics: measures of how well the model fits the data.

New cards

What is feature selection and extraction?

Feature selection: reducing the number of predictors by selecting the important ones (dimensionality reduction).
Feature extraction: transform the data onto a new feature space by means of a mathematical operation (ex. PCA).
- decreases number of features

New cards

What are the 2 main types of supervised learning?

classification: discrete output (labels)
- ex. color, yes/no
- examine the stats of two football teams and predict which team will win tmmrs match, discrete task
regression: continuous output (numbers)
- ex. temp, age, distance
- predict numerical value: predict the number of microsoft shares that will be traded tomorrow

binary classification: used to predict only 2 discrete-valued outputs such as 0 and 1

<ul><li><p><strong>classification</strong>: discrete output (labels)</p><ul><li><p>ex. color, yes/no</p></li><li><p>examine the stats of two football teams and predict which team will win tmmrs match, discrete task</p><p></p></li></ul></li><li><p><strong>regression</strong>: continuous output (numbers)</p><ul><li><p>ex. temp, age, distance</p></li><li><p>predict numerical value: predict the number of microsoft shares that will be traded tomorrow</p></li></ul></li></ul><p></p><p><u>binary classification</u>: used to predict only 2 discrete-valued outputs such as 0 and 1</p>

New cards

What are the DummyClassifier and DummyRegressor?

Do not generate any insight about the data

Serve as a simple baseline to compare against other, more complex classifiers/regressors
see how much better your model is compared to randomness

DummyClassifier

classifies the given data using only simple strategies; most-frequent, uniform, constant...
if mean is always closer to actual label, DummyClassifier beats you

DummyRegressor

makes predictions using simple strategies; mean, median ...

New cards

Describe Images used for machine learning data

Computers work with numbers

Images are arrays of numbers (RGB values for each pixel)
every pixel has a depth and a further grey scale, computer understands numerical values
number of pixels gives you number of features in dataset

New cards

Describe Text used for machine learning data

Words/ Letters need to be converted in a format computers can understand

have unstructured sentences, split by spaces (tokenize)…
issues with space to store all of this → use too much memory to store as 0 and 1 so they store the location of it instead (smaller dataset)

New cards

What happens with scaling data? - preprocessing

With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scale

scales of features are different from each other so the feature with largest scale could have too much power to determine outcome → prevent this by scaling the data down

New cards

What is the standard scaler?

z-scores (standard scores): mean of 0 and standard deviation of 1

common method in data normalization (good for non-skewed data)

*outliers have more weight when determining the mean

<p><strong>z-scores</strong> (standard scores): mean of 0 and standard deviation of 1 </p><ul><li><p>common method in data normalization (good for non-skewed data)</p></li></ul><p>*outliers have more weight when determining the mean</p>

New cards

What is the robust scaler?

Same as Standard Scaler, but with median instead of mean and interquartile range instead of standard deviation.

Better for skewed data
Deals better with outliers

*tighter cluster around (0,0), with axes more stable, outliers less extreme

<p>Same as Standard Scaler, but with <strong>median instead of mean and interquartile range instead of standard deviation</strong>. </p><ul><li><p>Better for skewed data </p></li><li><p>Deals better with outliers</p></li></ul><p>*tighter cluster around (0,0), with axes more stable, outliers less extreme</p>

New cards

What is the MinMax scaler?

Shifts data to an interval set by xmin and xmax

*between 0 and 1 interval

New cards

What is the normalizer?

Does not work by feature (column) but by row
Each row of data is rescaled so that its norm (magnitude) becomes 1, but the relative proportions of features within that row are preserved.
- Compute the norm of the vector (square root of the squared elements)
- Divide each element by the norm
Used only when the direction of data matters
- Helpful for histograms - Histograms often compare feature distributions across samples, Without normalization, histograms could be dominated by the largest- magnitude samples

*scales things per row, makes the length of every vector 1, good for histograms

New cards

What are the 4 types of scaled data?

standard → more scattered
robust → tighter
minmax → between 0 and 1
normalizer → height of 1

<ul><li><p>standard → more scattered</p></li><li><p>robust → tighter</p></li><li><p>minmax → between 0 and 1</p></li><li><p>normalizer → height of 1</p></li></ul><p></p>

New cards

What are Univariate Transformations?

Examples of univariate transformations: logarithmic, geometric, power ...
Most ML models perform best with Gaussian distributed data (bell curve)
Methods to transform data to Gaussian include Box-Cox transform and Yeo-Johnson transform
- Parameters can be automatically estimated so that skewness is minimized and variance is stabilized

<ul><li><p>Examples of univariate transformations: logarithmic, geometric, power ... </p></li><li><p>Most ML models perform best with Gaussian distributed data (bell curve) </p></li><li><p>Methods to transform data to Gaussian include Box-Cox transform and Yeo-Johnson transform </p><ul><li><p>Parameters can be automatically estimated so that skewness is minimized and variance is stabilized</p></li></ul></li></ul><p></p>

New cards

How do you transform to the log scale? - Univariate Transformations

New cards

What is Binning?

Separate feature values into n categories (ex. equally spaced over the range of values)
Replace all values within a category by a single value, ex. mean
Effective for models with few parameters, such as regression, but not effective for models with many parameters, such as decision trees

*way of preprocessing the data, but not common practice

<ul><li><p>Separate feature values into n categories (ex. equally spaced over the range of values)</p></li><li><p>Replace all values within a category by a single value, ex. mean</p></li><li><p>Effective for models with few parameters, such as regression, but not effective for models with many parameters, such as decision trees</p></li></ul><p>*way of preprocessing the data, but not common practice</p>

New cards

What does it mean to measure classification success?

How “predictive” are the models we have learnt?
New data is probably not exactly the same as the training data
- What happens if we overfit our data?

*accuracy is a way to measure how well your model generalizes to unseen data, a more complex model = accuracy goes closer to 1

green = test set

blue = training set

overfitting: model is 100% tailored to training set
underfitting: model isn’t 100% tailored to training set

<ul><li><p>How “predictive” are the models we have learnt?</p></li><li><p>New data is probably not exactly the same as the training data</p><ul><li><p>What happens if we overfit our data?</p></li></ul></li></ul><p>*accuracy is a way to measure how well your model generalizes to unseen data, a more complex model = accuracy goes closer to 1</p><p>green = test set</p><p>blue = training set</p><ul><li><p>overfitting: model is 100% tailored to training set</p></li><li><p>underfitting: model isn’t 100% tailored to training set </p></li></ul><p></p>

New cards

How to you avoid over-fitting?

Build a classifier using the training set and evaluate it using the test set

use more data points for training set, make it more representative of the data universe

<p>Build a <strong>classifier</strong> using the training set and evaluate it using the test set</p><ul><li><p>use more data points for training set, make it more representative of the data universe</p></li></ul><p></p>

New cards

Subject-wise splitting

you want to include all the info on patients in either training or test set

leakage = not including all data on humans in either training or test set

New cards

What is cross validation?

To evaluate (test) your model’s ability to predict new data
Detect overfitting or selection bias

<ul><li><p>To evaluate (test) your model’s ability to predict new data </p></li><li><p>Detect overfitting or selection bias</p></li></ul><p></p>

New cards

What is a cross validation technique?

K-fold cross-validation

leave one out (K-fold cross-validation to the extreme)

*split data into tables and in each batch use as training or test set so all data points are used in the model, calculate accuracy 4 times…

*important for small datasets

<p><strong>K-fold cross-validation</strong></p><ul><li><p>leave one out (K-fold cross-validation to the extreme)</p></li></ul><p>*split data into tables and in each batch use as training or test set so all data points are used in the model, calculate accuracy 4 times…</p><p>*important for small datasets</p>

New cards

What are Machine learning pipelines?

Workflows to execute a sequence of tasks

Data normalization (scaling)
Imputation of missing values
Dimensionality Reduction
Classification

New cards

What is clustering?

Making a new classification system for pokemon characters according to similarities in their characteristic such as strength, speed, defensiveness → features, unsupervised learning, no labels

New cards

Lecture 2 - Feature Engineering

New cards

What is Missing Values Imputation?

In real world datasets, missing input values are very common
No standard encoding (blank, 0, “NA”, NaN, Null, ...)
Imputation: replacing missing value with estimate for that value
- Mean / median
- KNN
- Model-driven
- Iterative

New cards

Describe the 3 approaches for missing values imputation

Mean: Taking the mean gives you an unrealistic data set

in image, rounded dots are original feature values, colors indicate labels
you can take another approach → KNN

KNN:

more realistic bcs all the squares are in the neighborhood of the actual data points, calculated based on things surrounding data point
k = number of neighbors you’re taking into account

*can also drop the missing value

<p><strong>Mean: </strong>Taking the mean gives you an unrealistic data set</p><ul><li><p>in image, rounded dots are original feature values, colors indicate labels</p></li><li><p>you can take another approach → KNN</p></li></ul><p></p><p><strong>KNN:</strong></p><ul><li><p>more realistic bcs all the squares are in the neighborhood of the actual data points, calculated based on things surrounding data point</p></li><li><p>k = number of neighbors you’re taking into account</p></li></ul><p></p><p>*can also drop the missing value</p>

New cards

What are categorical variables?

Data often has categorical (or discrete) features → non-numerical, no order (gender, city…)
Remember measurement levels: categorical – ordinal – interval – ratio.
Often necessary to represent categorical features as numbers.
- One Hot encoding
- Count-based encoding

New cards

<p>What is <strong>One Hot Encoding?</strong></p>

What is One Hot Encoding?

gives every category a numerical value → ex. 3 colors red,green,blue get 3 vectors of 1,0,0 + 0,1,0 + 0,0,1

*values dont have importance, more abt labeling and organizing

important that the math used by machine learning models is not affected by the encoding → impossible to use 1,2,3, ...
Adding one feature for each category (feature encodes whether a sample belongs to this category or not)

→ all colours are equally distant from each other

<ul><li><p>gives every category a numerical value → ex. 3 colors red,green,blue get 3 vectors of 1,0,0 + 0,1,0 + 0,0,1</p></li></ul><p>*values dont have importance, more abt labeling and organizing</p><ul><li><p>important that the math used by machine learning models is not affected by the encoding → impossible to use 1,2,3, ...</p></li><li><p>Adding one feature for each category (feature encodes whether a sample belongs to this category or not)</p></li></ul><p></p><p>→ all colours are equally distant from each other</p>

New cards

What is Count-Based Encoding?

For high cardinality categorical features → ex. countries
Instead of 50 one-hot variables, replace label with the value of a variable aggregated over that label.
For regression: “people in this state have an average response of y”
Binary classification: “people in this state have likelihood p for class 1”
Multiclass: One feature per class: probability distribution

*choose the average value of the concept you’re interested in, ex. for every city put the average temp

New cards

Describe how digital images are processed

The values are all discrete and integers.
Can be considered as a large array of discrete dots
Each dot has a brightness associated with it.
These dots are called picture elements - pixels

images consist of pixels, each pixel has 3 colors, each color has an intensity

every pixel is a dot, every dot has either red, green, blue
every color has intensity → has 3 values

<ul><li><p>The values are all discrete and <strong>integers</strong>.</p></li><li><p>Can be considered as a large array of discrete dots</p></li><li><p>Each dot has a brightness associated with it.</p></li><li><p>These dots are called picture elements - <strong>pixels</strong></p></li></ul><p></p><p>images consist of pixels, each pixel has 3 colors, each color has an intensity</p><ul><li><p>every pixel is a dot, every dot has either red, green, blue</p></li><li><p>every color has intensity → has 3 values</p></li></ul><p></p>

New cards

Describe Arrays and Images

Images are represented as matrices (e.g. numpy arrays)
Can be written as a function f(x,y)
Types of images: Binary Images, Grayscale Images and Color Images

<ul><li><p>Images are represented as matrices (e.g. numpy arrays) </p></li><li><p>Can be written as a function f(x,y) </p></li><li><p>Types of images: Binary Images, Grayscale Images and Color Images</p></li></ul><p></p>

New cards

Describe Greyscale Images

Each pixel is a shade of gray
Normally from 0 (black) to 255 (white). Each pixel can be represented by eight bits, or exactly one byte.
Other grayscale ranges are used, but generally are a power of 2. (22 = 4, 24 = 64)

*only have one intensity value

<ul><li><p>Each pixel is a shade of gray</p></li><li><p>Normally from 0 (black) to 255 (white). Each pixel can be represented by eight bits, or exactly one byte.</p></li><li><p>Other grayscale ranges are used, but generally are a power of 2. (22 = 4, 24 = 64)</p></li></ul><p></p><p>*only have one intensity value</p>

New cards

Describe Multi-channel Images

Such images is a stack of multiple matrices; representing the multiple channel values for each pixel
Ex. RGB color is described by the amount of red, green and blue in it

*3 ranges, 3 values for every pixel, intensities for each color

<ul><li><p>Such images is a stack of multiple matrices; representing the multiple channel values for each pixel</p></li><li><p>Ex. RGB color is described by the amount of red, green and blue in it</p></li></ul><p>*3 ranges, 3 values for every pixel, intensities for each color</p>

New cards

<p>What are measures for segmentation for <strong>machine learning with images</strong>?</p>

What are measures for segmentation for machine learning with images?

A segmentation result can be measured if the ground truth is known
Empirical Measures:
- Accuracy, Precision and Recall
- F-score, Jaccard Index

New cards

What is accuracy and what are problems with it?

Accuracy: sum of diagonals over everything → TP+TN / TP+TN+FP+FN

Problems with accuracy:

Imbalanced classes lead to hard-to-interpret accuracy

positive = there, negative = not there

*want to minimize false positives in SPAM email case for example

*false negative = you miss smth, detrimental in healthcare diagnosis instances

<p><strong>Accuracy</strong>: sum of diagonals over everything → TP+TN / TP+TN+FP+FN</p><p><strong>Problems with accuracy</strong>:</p><ul><li><p>Imbalanced classes lead to hard-to-interpret accuracy</p></li></ul><p></p><p>positive = there, negative = not there</p><p>*want to minimize false positives in SPAM email case for example</p><p>*false negative = you miss smth, detrimental in healthcare diagnosis instances</p>

New cards

Precision, Recall, F-score

precision is inversely proportional with false positives → use to minimize FP

recall is inversely proportional with FN → use to minimize FN

*use when data set isn’t balanced

New cards

Describe how machine learning works with text data

Most Machine Learning algorithms prefer to work with numbers
So far:
- Fixed number of features
- Continuous
- Categorical
Working with Text Data
- no pre-defined features
- Need to create fixed-length descriptions

*super unstructured and abundant

New cards

Features from text: Bag of words

split into parts: tokenizer
build a vocabulary with all the words
create a dictionary
represent into numerical form

1 = word is there

0 = word is not there

length of array = how many words you have in the document

*more 1’s means more words present

<ul><li><p>split into parts: tokenizer</p></li><li><p>build a vocabulary with all the words</p></li><li><p>create a dictionary</p></li><li><p>represent into numerical form</p></li></ul><p></p><p>1 = word is there</p><p>0 = word is not there</p><p>length of array = how many words you have in the document</p><p>*more 1’s means more words present</p>

New cards

Describe Text Data Reprocessing

Tokenization — convert sentences to words
Removing unnecessary punctuation, tags
Removing stop words— frequent words such as ”the”, ”is”, etc. that have low semantic content
Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
Lemmatization —Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language

New cards

What is Tokenization?

Process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens
The list of tokens turns into input for in additional processing including parsing or text mining
Tokenization can swap out sensitive data
- Ex. Typically payment card or bank account numbers—with a randomized number in the same format

New cards

What is Stemming and Lemmatization?

Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.

The stemmed form of studies is: studi
The stemmed form of studying is: study

Lemmatization —Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language

The lemmatized form of studies is: study
The lemmatized form of studying is: study

New cards

How do you Restrict the Vocabulary?

Removing unnecessary punctuation, tags
Removing stop words— frequent words such as ”the”, ”is”, etc. that have low semantic content
Removing infrequent words
- Words that appear only once or twice might not be helpful
- Restrict vocabulary size to only most frequent words (for less features)

New cards

What is Bag of Words?

Most common technique to numerically represent text is Bag of Words.
Represents each sentence or document as a vector with a value for each word in the vocabulary.
- Binary: word present or absent in the document
- Count: how often the word appears in the document
- Popular approach: Term Frequency x Inverse Document Frequency (TF-IDF)

New cards

What is Term Frequency-Inverse Document Frequency (TF-IDF)?

Term Frequency (TF) = Number of times term t appears in a document/Number of terms in the document
Inverse Document Frequency (IDF) = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.
We calculate TF-IDF value of a term as = TF * IDF

<ul><li><p><strong>Term Frequency (TF)</strong> = Number of times term t appears in a document/Number of terms in the document</p></li><li><p><strong>Inverse Document Frequency (IDF) </strong>= log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.</p></li><li><p>We calculate <strong>TF-IDF </strong>value of a term as = TF * IDF</p></li></ul><p></p>

New cards

Working with Text Data

New cards

Describe Feature Selection

Why select features?

Avoid overfitting
Faster prediction and training
Less storage for model and dataset

Strategies

Univariate statistics
Model-based selection
Iterative selection

New cards

What are Univariate Statistics?

Look at each feature individually
Features will be removed if they do not have a significant relationship with the target
Features that are significant only in combination with another feature (interaction) will be removed.
Selecting features with highest confidence is related to ANOVA (from statistics)

Pick statistic, check p-values!

f_regression, f_classif, chi2 in scikit-learn

<ul><li><p>Look at each feature individually </p></li><li><p>Features will be removed if they do not have a significant relationship with the target </p></li><li><p>Features that are significant only in combination with another feature (interaction) will be removed. </p></li><li><p>Selecting features with highest confidence is related to ANOVA (from statistics)</p></li></ul><p></p><p>Pick statistic, check p-values! </p><p>f_regression, f_classif, chi2 in scikit-learn</p>

New cards

What is Model-Based Feature Selection?

Get best fit for a particular model
Ideally: exhaustive search over all possible combinations
Exhaustive is infeasible (and has multiple testing issues)
Use heuristics in practice

New cards

What is Model based (single fit)?

Build a model, select features most important to model.
Lasso, other linear models, tree-based models
Multivariate – linear models assume linear relation

<ul><li><p>Build a model, select features most important to model. </p></li><li><p>Lasso, other linear models, tree-based models </p></li><li><p>Multivariate – linear models assume linear relation</p></li></ul><p></p>

New cards

What is iterative Model-Based Selection?

Forwards: Start with single feature, find most important feature, add, iterate
Backwards: Fit model, find least important feature, remove, iterate
Computationally expensive

New cards

RFE : Recursive feature elimination and selection

New cards

LECTURE 3 - KNN

New cards

What is k-NN? (k-nearest neighbors)

way of labeling data points in the absence of labels in a dataset
k-NN can be used for both regression and classification problems → need to label data points and predict the label, you look at the nearest neighbors to determine the target value
algorithm depends on distance
k = hyperparameter (choose the value)
if k decreases you will have decision boundaries in the dataset → use the closest neighbor, might imply overfitting
- k of 5 is a larger population to make a prediction
downsides: requires a lot of calculations, but doesn’t require a training process

New cards

What are the unknown data points? (1-NN)

New cards

What is the k-NN classifier?

The hyperparameter k represents the number of labeled neighbours to consider
Test points are assigned the majority label of the k nearest neighbours
Special cases:
- k = N: since all datapoints are considered, the predicted label for a test point will always be the the majority label of all datapoints. Equivalent to a majority classifier.
- Ties: in case of an tie between predicted labels, there are different possibilities. The most common one is random selection from the tied labels

*if k is even, it’s problematic, ambiguity in the process

New cards

What is the nearest-neighbor? (3-NN)

taking more neighbors into account can change the predicted value

<ul><li><p>taking more neighbors into account can change the predicted value</p></li></ul><p></p>

New cards

What is K-Nearest Neighbours Hypothesis Space?

two-dimensional feature space split into some regions whose centroid are the data points
concept seen in nature → skin of giraffes

New cards

What is the influence of k on the decision boundary?

smaller k, you take into account the neighbors closest
increasing k means you have a smoother decision boundary

*large k = less complex model bcs you take a lot of data points into decision which makes the decision boundary smoother

<ul><li><p>smaller k, you take into account the neighbors closest</p></li><li><p>increasing k means you have a smoother decision boundary</p></li></ul><p></p><p>*large k = less complex model bcs you take a lot of data points into decision which makes the decision boundary smoother</p>

New cards

What is the label (class) of a point on the decision boundary?

it’s ambigious

you can’t make a decision on the problem with one feature
cant find the closest neighbor

New cards

Describe weights in k-NN

weights help to make machine learning algorithms smoother
having 2 close neighbors + 1 far creates a risk → fix by applying neighbors that are closer
can make weights uniform (all have equal importance), use distance, make weights different → flexibility of k-NN

Extension of the basic algorithm: not all neighbors get an equal vote
Distance-weighting: each neighbor has a weight which is based on its distance to the data point to be classified
- Inverse distance weighting – each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote)
- Inverse of the square of the distance
- Kernel functions (Gaussian kernel, tricube kernel)
If we change the distance function, the results will change.
Implication: with distance weighting, k=n is no longer equivalent to a majority based classifier. Weights in k-NN

<ul><li><p>weights help to make machine learning algorithms smoother</p></li><li><p>having 2 close neighbors + 1 far creates a risk → fix by applying neighbors that are closer</p></li><li><p>can make weights uniform (all have equal importance), use distance, make weights different →<strong> flexibility of k-NN</strong></p></li></ul><p></p><ul><li><p>Extension of the basic algorithm: not all neighbors get an equal vote </p></li><li><p><strong>Distance-weighting</strong>: each neighbor has a weight which is based on its distance to the data point to be classified </p><ul><li><p>Inverse distance weighting – each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote) </p></li><li><p>Inverse of the square of the distance </p></li><li><p>Kernel functions (Gaussian kernel, tricube kernel) </p></li></ul></li><li><p>If we change the distance function, the results will change. </p></li><li><p>Implication: with distance weighting, k=n is no longer equivalent to a majority based classifier. Weights in k-NN</p></li></ul><p></p>

New cards

How do you compute distance in k-NN?

Different ways to define the distance function

Euclidean distance (straight line)
Manhattan distance (distance between projections on the axis)
Difference between Euclidean and Manhattan distance

New cards

How does k determine model complexity?

The model in k-NN is the decision boundary that separates the classes (In regression, the model is the line that fits the data)
Smaller k leads to more complex decision boundaries
k too low → danger of overfitting, high complexity
k too high → danger of underfitting, low complexity

*doing well on unseen data points = generalization

<ul><li><p>The model in k-NN is the decision boundary that separates the classes (In regression, the model is the line that fits the data) </p></li><li><p>Smaller k leads to more complex decision boundaries </p></li><li><p><strong>k too low → danger of overfitting, high complexity </strong></p></li><li><p><strong>k too high → danger of underfitting, low complexity</strong></p></li></ul><p></p><p>*doing well on unseen data points = generalization</p>

New cards

What is the bias variance trade off?

Variance: how sensitive the model is to small changes in the training data.

A high-variance model will give very different decision boundaries if you slightly change the dataset.

In kNN:

Small 𝑘:?
Large 𝑘: ?

Bias: how far the model’s average prediction is from the true underlying relationship.

A high-bias model makes strong assumptions and might miss important patterns.

In kNN:

Small 𝑘: ?
Large 𝑘: ?

A high-variance model will give very different decision boundaries if you slightly change the dataset.

In kNN:

Small 𝑘(e.g. 𝑘=1): decision boundary can shift a lot if just one training point changes → high variance.
Large 𝑘: more stable, since predictions average over many neighbors → low variance.

A high-bias model makes strong assumptions and might miss important patterns.

In kNN:

Small 𝑘: very flexible → low bias, because it can fit even complicated shapes.
Large 𝑘: very smooth boundaries → high bias, because it might oversimplify (e.g. blur two nearby classes to

New cards

How do you determine model complexity?

Depends on complexity of the separation between the classes
Start with the simplest model (large k in k-NN), and increase complexity (smaller k)

-trying to see generalization

-overfitting decreases w larger k value

<ul><li><p>Depends on complexity of the separation between the classes</p></li><li><p>Start with the simplest model (large k in k-NN), and increase complexity (smaller k)</p></li></ul><p></p><p>-trying to see generalization</p><p>-overfitting decreases w larger k value</p><p></p>

New cards

How do you choose k?

Typically odd for an even number of classes (ex. 1, 3, 5, 7..)
As you decrease k, accuracy might increase, but so does complexity
In other words, a small value of k is likely to lead to overfitting (fitting “noise”)
A rule of thumb used by some data-miners: 𝑘 = sqrt(𝑛)

New cards

What 3 sets is your data divided into to tune hyperparameters?

training
validation
test

<ul><li><p>training </p></li><li><p>validation </p></li><li><p>test </p></li></ul><p></p>

New cards

What is the nearest centroid?

simple method to take the centroid of every cross, calculate one number for one cross and then for new data points you just look at which is closer to either the blue or red cross

<p>simple method to take the centroid of every cross, calculate one number for one cross and then for new data points you just look at which is closer to either the blue or red cross</p>

New cards

What is the nearest shrunken centroid?

Nearest centroid classification:

Takes a new sample, and compares it to each of these class centroids. The class whose centroid it is closest to, in squared distance, is the predicted class for that new sample.

Nearest shrunken centroid classification:

"shrinks" each of the class centroids toward the overall centroid for all classes by an amount we call the threshold. This shrinkage consists of moving the centroid towards zero by threshold, setting it equal to zero if it hits zero. For example if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of 1.2 would be shrunk to zero.
After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids

-increasing threshold = moving closer to the center

-useful when you have a huge number of features

-helps to eliminate the features, unimportant ones

<p><strong><u>Nearest centroid classification:</u></strong></p><p>Takes a new sample, and compares it to each of these class centroids. The class whose centroid it is closest to, in squared distance, is the predicted class for that new sample.</p><p></p><p><u>Nearest shrunken centroid classification:</u></p><ul><li><p>"shrinks" each of the class centroids toward the overall centroid for all classes by an amount we call the <strong>threshold</strong>. This shrinkage consists of moving the centroid towards zero by <strong>threshold</strong>, setting it equal to zero if it hits zero. For example if <strong>threshold</strong> was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of 1.2 would be shrunk to zero.</p></li><li><p>After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids</p></li></ul><p></p><p>-increasing threshold = moving closer to the center</p><p>-useful when you have a huge number of features</p><p>-helps to eliminate the features, unimportant ones</p>

New cards

How is nearest centroid problematic? kNN vs nearest centroid

nearest centroid would misclassify data points

<ul><li><p>nearest centroid would misclassify data points</p></li></ul><p></p>

New cards

Describe classification vs regression

Classification: The model trained from the data defines a decision boundary that separates the data

Regression: The model fits the data to describe the relation between 2 features or between a feature (ex. height) and the label (ex. yes/no)

<p><strong>Classification</strong>: The model trained from the data defines a decision boundary that <strong>separates</strong> the data</p><p><strong>Regression:</strong> The model <strong>fits</strong> the data to describe the relation between 2 features or between a feature (ex. height) and the label (ex. yes/no)</p>

New cards

What is k-NN regression?

k-NN classification combines the discrete predictions of k-neighbours
k-NN regression combines continuous predictions
k-NN regression fits the best line between the neighbors

in the diagram you use weights, 2 versions of the same regression task, top uniform, bottom distance

<ul><li><p>k-NN classification combines the discrete predictions of k-neighbours</p></li><li><p>k-NN regression combines continuous predictions </p></li><li><p>k-NN regression fits the best line between the neighbors</p></li></ul><p></p><p>in the diagram you use weights, 2 versions of the same regression task, top uniform, bottom distance</p>

New cards

What are 3 k-NN advantages?

The cost of the learning process is zero
No assumptions about the characteristics of the concepts to learn have to be done
Complex concepts can be learned by local approximation using simple procedures

New cards

Describe kNN for missing values imputation

New cards

What are 3 k-NN disadvantages?

The model can not be interpreted (there is no description of the learned concepts)
It is computationally expensive to find the k nearest neighbours when the dataset is very large
Performance depends on the number of dimensions that we have (curse of dimensionality)

New cards

What is the curse of dimensionality and overfitting?

Slide 1:

More information is needed for classification, therefore we add a second feature

Feature 2: average amount of green color in image

Slide 2:

Even more information is needed for classification, therefore we add a third feature
Feature 3: average amount of blue color in image

Slide 3:

In three dimensions (= three features), perfect separation of CATS and DOGS is possible with a decision boundary (plane)

<p><u>Slide 1:</u></p><ul><li><p>More information is needed for classification, therefore we add a second feature </p></li></ul><ul><li><p>Feature 2: average amount of green color in image</p></li></ul><p></p><p><u>Slide 2:</u></p><ul><li><p>Even more information is needed for classification, therefore we add a third feature </p></li><li><p>Feature 3: average amount of blue color in image</p></li></ul><p></p><p><u>Slide 3:</u></p><ul><li><p>In three dimensions (= three features), perfect separation of CATS and DOGS is possible with a decision boundary (plane)</p></li></ul><p></p>

New cards

In the dog-cat example, Does adding features improve classification?

This example suggests that by adding (informative) features, classification is improved. This is often the case, but... Adding new features increases the volume of feature space exponentially

For instance: 1 feature has 10 different values
- 1 feature: 10 possible feature values
- 2 features: 100 possible feature values
- 3 features: 1000 possible feature values

as more features are added, data becomes sparse, distances lose meaning, and algorithms that depend on “closeness” struggle.