Machine Learning

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/82

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

83 Terms

1
New cards

What is machine learning?

Teaching a computer to do a task using “data” (ex. examples of what we want it to do)

  • ex. trying to distinguish between a greyhound vs labrador

    • Create Dataset: Collect some examples of Greyhounds and Labradors

    • Collect photos of dogs in each breed

    • Describe each dog to the computer

    • Height of specific dog *

    • Weight

    • Color

*machine learning doesn’t predict randomness but instead learns from patterns in the data + learning from experience

  • exposure to more data increases learning quality → optimization issue (trying to minimize error)

<p><u>Teaching a computer to do a task using “data”</u> (ex. examples of what we want it to do)</p><ul><li><p>ex. trying to distinguish between a greyhound vs labrador</p><ul><li><p>Create Dataset: Collect some examples of Greyhounds and Labradors</p></li><li><p>Collect photos of dogs in each breed</p></li><li><p>Describe each dog to the computer</p></li><li><p>Height of specific dog *</p></li><li><p>Weight</p></li><li><p>Color</p><p></p></li></ul></li></ul><p>*machine learning doesn’t predict randomness but instead learns from patterns in the data + learning from experience</p><ul><li><p>exposure to more data increases learning quality → optimization issue (trying to minimize error)</p></li></ul><p></p>
2
New cards

How do you increase the prediction quality in machine learning - distinguishing between a labrador vs greyhound?

Adding more features increases the prediction quality

  • too many features = exhaust resources

  • too little features = more errors

feature value = ex. 68 years, male

target value = ex. has diabetes

feature = ex. age

Decision boundary - cut off threshold

<p><u>Adding more features</u> increases the prediction quality</p><ul><li><p>too many features = exhaust resources</p></li><li><p>too little features = more errors</p></li></ul><p></p><p>feature value = ex. 68 years, male</p><p>target value = ex. has diabetes</p><p>feature = ex. age</p><p><strong>Decision boundary</strong> - cut off threshold</p>
3
New cards

Why did machine learning become more relevant?

The evolution of digital data → huge expansion

  • used to have structured data → highly organized, excel spreadsheets..

  • now we have unstructured data → emails, text and video files, pictures, satellite data…

Volume of data collected grows daily

  • More than 90% of the data in this world within the past two years!!!

  • Data is cheap and abundant but ... knowledge is expensive and scare

  • To make sense of all the unstructured data: we need knowledge discovery

  • Machine learning: computers learn from data to aid knowledge discovery

<p><u>The evolution of digital data</u> → huge expansion</p><ul><li><p>used to have structured data → highly organized, excel spreadsheets..</p></li><li><p>now we have <strong>unstructured data</strong> → emails, text and video files, pictures, satellite data…</p></li></ul><p></p><p>Volume of data collected grows daily</p><ul><li><p>More than 90% of the data in this world within the past two years!!!</p></li><li><p>Data is cheap and abundant but ... knowledge is expensive and scare</p></li><li><p>To make sense of all the unstructured data: we need <strong>knowledge discovery</strong></p></li></ul><ul><li><p><strong><u>Machine learning:</u></strong><u> computers learn from data to aid knowledge discovery</u></p></li></ul><p></p>
4
New cards

What are the 3 main branches of machine learning?

  • supervised learning: have data and labels, model predicts a label for new data

  • unsupervised learning: might not have labels, bunch of cat pics but no labels, without any prior labeling

  • reinforcement learning: reward system, maximize reward, system learns by doing

<ul><li><p><strong>supervised learning</strong>: have data and labels, model predicts a label for new data</p></li><li><p><strong>unsupervised learning</strong>: might not have labels, bunch of cat pics but no labels, without any prior labeling</p></li><li><p><strong>reinforcement learning</strong>: reward system, maximize reward, system learns by doing</p></li></ul><p></p>
5
New cards
<p>Describe Supervised learning</p>

Describe Supervised learning

Given a set of data and labels, a model learns to predict a label for new data, labels

• Given D = {Xi,Yi} learn a model (or function) F: Xk → Yk

  • Often used to automate manual labor

    • ex. you might annotate part of a dataset manually, then learn a machine learning model from these annotations, and use the model to annotate the rest of your data

examples:

  • Given a satellite image, what is the terrain in the image?

    • Xi = pixels (image regions), Yi = terrain type

  • Given some test results from a patient, will the patient have diabetes?

    • Xi = test results, Yi = diabetes/no diabetes

<p>Given a set of data and labels, a model learns to predict a label for new data, <strong>labels</strong></p><p>• Given D = {Xi,Yi} learn a model (or function) F: Xk → Yk</p><ul><li><p>Often used to automate manual labor</p><ul><li><p>ex. you might annotate part of a dataset manually, then learn a machine learning model from these annotations, and use the model to annotate the rest of your data</p></li></ul></li></ul><p><u>examples</u>:</p><ul><li><p>Given a satellite image, what is the terrain in the image?</p><ul><li><p>Xi = pixels (image regions), Yi = terrain type</p></li></ul></li><li><p>Given some test results from a patient, will the patient have diabetes?</p><ul><li><p>Xi = test results, Yi = diabetes/no diabetes</p></li></ul></li></ul><p></p>
6
New cards
<p>Describe Unsupervised learning</p>

Describe Unsupervised learning

Discover patterns in data, no labels

  • Given D = {Xi} group the data into Y classes using a model (or function)

  • F: Xi → Yj

examples:

  • Discovering trending topics on Twitter or in the news

  • Grouping data into clusters for easier analysis

  • Outlier detection (ex. Fraud detection and security systems)

7
New cards
<p>Describe Reinforcement learning</p>

Describe Reinforcement learning

Reasoning under uncertainty to make optimal decisions, reward mechanism, optimization of minimizing error

  • how agents should to take actions in an environment to maximize some reward.

  • Given D = {environment (e), actions (a), rewards (r)} learn a policy and utility functions:

    • policy: F1: {e,r} → a

    • utility: F2: {a,e} → r

8
New cards
<p>Describe Machine Learning workflow</p>

Describe Machine Learning workflow

user gets data, you choose algorithm and clean data, include and exclude data, split label data into training set test set, learn the model, test learned relationship on unseen data

  • Pre-processing data is related to which model you use

9
New cards
<p>What is a model and score functions/score metrics?</p>

What is a model and score functions/score metrics?

  • A model: an equation that links the values of some features to the predicted value of the target variable; finding the equation (and coefficients in it) is called ‘building a model’ (see also ‘fitting a model’).

  • Score functions/Fit statistics/Score metrics: measures of how well the model fits the data.

10
New cards

What is feature selection and extraction?

  • Feature selection: reducing the number of predictors by selecting the important ones (dimensionality reduction).

  • Feature extraction: transform the data onto a new feature space by means of a mathematical operation (ex. PCA).

    • decreases number of features

11
New cards

What are the 2 main types of supervised learning?

  • classification: discrete output (labels)

    • ex. color, yes/no

    • examine the stats of two football teams and predict which team will win tmmrs match, discrete task

  • regression: continuous output (numbers)

    • ex. temp, age, distance

    • predict numerical value: predict the number of microsoft shares that will be traded tomorrow

binary classification: used to predict only 2 discrete-valued outputs such as 0 and 1

<ul><li><p><strong>classification</strong>: discrete output (labels)</p><ul><li><p>ex. color, yes/no</p></li><li><p>examine the stats of two football teams and predict which team will win tmmrs match, discrete task</p><p></p></li></ul></li><li><p><strong>regression</strong>: continuous output (numbers)</p><ul><li><p>ex. temp, age, distance</p></li><li><p>predict numerical value: predict the number of microsoft shares that will be traded tomorrow</p></li></ul></li></ul><p></p><p><u>binary classification</u>: used to predict only 2 discrete-valued outputs such as 0 and 1</p>
12
New cards

What are the DummyClassifier and DummyRegressor?

Do not generate any insight about the data

  • Serve as a simple baseline to compare against other, more complex classifiers/regressors

  • see how much better your model is compared to randomness

DummyClassifier

  • classifies the given data using only simple strategies; most-frequent, uniform, constant...

  • if mean is always closer to actual label, DummyClassifier beats you

DummyRegressor

  • makes predictions using simple strategies; mean, median ...

13
New cards

Describe Images used for machine learning data

Computers work with numbers

  • Images are arrays of numbers (RGB values for each pixel)

  • every pixel has a depth and a further grey scale, computer understands numerical values

  • number of pixels gives you number of features in dataset

<p>Computers work with numbers </p><ul><li><p>Images are arrays of numbers (RGB values for each pixel)</p></li><li><p>every pixel has a depth and a further grey scale, computer understands numerical values</p></li><li><p>number of pixels gives you number of features in dataset</p></li></ul><p></p>
14
New cards

Describe Text used for machine learning data

Words/ Letters need to be converted in a format computers can understand

  • have unstructured sentences, split by spaces (tokenize)…

  • issues with space to store all of this → use too much memory to store as 0 and 1 so they store the location of it instead (smaller dataset)

<p>Words/ Letters need to be converted in a format computers can understand </p><ul><li><p>have unstructured sentences, split by spaces (tokenize)…</p></li><li><p>issues with space to store all of this → use too much memory to store as 0 and 1 so they <strong>store the location</strong> of it instead (smaller dataset)</p></li></ul><p></p>
15
New cards

What happens with scaling data? - preprocessing

With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scale

  • scales of features are different from each other so the feature with largest scale could have too much power to determine outcome → prevent this by scaling the data down

<p>With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scale</p><ul><li><p>scales of features are different from each other so the feature with largest scale could have too much power to determine outcome → prevent this by <strong>scaling the data down</strong></p></li></ul><p></p>
16
New cards

What is the standard scaler?

z-scores (standard scores): mean of 0 and standard deviation of 1

  • common method in data normalization (good for non-skewed data)

*outliers have more weight when determining the mean

<p><strong>z-scores</strong> (standard scores): mean of 0 and standard deviation of 1 </p><ul><li><p>common method in data normalization (good for non-skewed data)</p></li></ul><p>*outliers have more weight when determining the mean</p>
17
New cards

What is the robust scaler?

Same as Standard Scaler, but with median instead of mean and interquartile range instead of standard deviation.

  • Better for skewed data

  • Deals better with outliers

*tighter cluster around (0,0), with axes more stable, outliers less extreme

<p>Same as Standard Scaler, but with <strong>median instead of mean and interquartile range instead of standard deviation</strong>. </p><ul><li><p>Better for skewed data </p></li><li><p>Deals better with outliers</p></li></ul><p>*tighter cluster around (0,0), with axes more stable, outliers less extreme</p>
18
New cards

What is the MinMax scaler?

Shifts data to an interval set by xmin and xmax

*between 0 and 1 interval

<p>Shifts data to an interval set by xmin and xmax</p><p>*between 0 and 1 interval</p>
19
New cards

What is the normalizer?

  • Does not work by feature (column) but by row

  • Each row of data is rescaled so that its norm (magnitude) becomes 1, but the relative proportions of features within that row are preserved.

    • Compute the norm of the vector (square root of the squared elements)

    • Divide each element by the norm

  • Used only when the direction of data matters

    • Helpful for histograms - Histograms often compare feature distributions across samples, Without normalization, histograms could be dominated by the largest- magnitude samples

*scales things per row, makes the length of every vector 1, good for histograms

20
New cards
<p>What are the 4 types of scaled data?</p>

What are the 4 types of scaled data?

  • standard → more scattered

  • robust → tighter

  • minmax → between 0 and 1

  • normalizer → height of 1

<ul><li><p>standard → more scattered</p></li><li><p>robust → tighter</p></li><li><p>minmax → between 0 and 1</p></li><li><p>normalizer → height of 1</p></li></ul><p></p>
21
New cards

What are Univariate Transformations?

  • Examples of univariate transformations: logarithmic, geometric, power ...

  • Most ML models perform best with Gaussian distributed data (bell curve)

  • Methods to transform data to Gaussian include Box-Cox transform and Yeo-Johnson transform

    • Parameters can be automatically estimated so that skewness is minimized and variance is stabilized

<ul><li><p>Examples of univariate transformations: logarithmic, geometric, power ... </p></li><li><p>Most ML models perform best with Gaussian distributed data (bell curve) </p></li><li><p>Methods to transform data to Gaussian include Box-Cox transform and Yeo-Johnson transform </p><ul><li><p>Parameters can be automatically estimated so that skewness is minimized and variance is stabilized</p></li></ul></li></ul><p></p>
22
New cards

How do you transform to the log scale? - Univariate Transformations

knowt flashcard image
23
New cards

What is Binning?

  • Separate feature values into n categories (ex. equally spaced over the range of values)

  • Replace all values within a category by a single value, ex. mean

  • Effective for models with few parameters, such as regression, but not effective for models with many parameters, such as decision trees

*way of preprocessing the data, but not common practice

<ul><li><p>Separate feature values into n categories (ex. equally spaced over the range of values)</p></li><li><p>Replace all values within a category by a single value, ex. mean</p></li><li><p>Effective for models with few parameters, such as regression, but not effective for models with many parameters, such as decision trees</p></li></ul><p>*way of preprocessing the data, but not common practice</p>
24
New cards

What does it mean to measure classification success?

  • How “predictive” are the models we have learnt?

  • New data is probably not exactly the same as the training data

    • What happens if we overfit our data?

*accuracy is a way to measure how well your model generalizes to unseen data, a more complex model = accuracy goes closer to 1

green = test set

blue = training set

  • overfitting: model is 100% tailored to training set

  • underfitting: model isn’t 100% tailored to training set

<ul><li><p>How “predictive” are the models we have learnt?</p></li><li><p>New data is probably not exactly the same as the training data</p><ul><li><p>What happens if we overfit our data?</p></li></ul></li></ul><p>*accuracy is a way to measure how well your model generalizes to unseen data, a more complex model = accuracy goes closer to 1</p><p>green = test set</p><p>blue = training set</p><ul><li><p>overfitting: model is 100% tailored to training set</p></li><li><p>underfitting: model isn’t 100% tailored to training set </p></li></ul><p></p>
25
New cards

How to you avoid over-fitting?

Build a classifier using the training set and evaluate it using the test set

  • use more data points for training set, make it more representative of the data universe

<p>Build a <strong>classifier</strong> using the training set and evaluate it using the test set</p><ul><li><p>use more data points for training set, make it more representative of the data universe</p></li></ul><p></p>
26
New cards

Subject-wise splitting

you want to include all the info on patients in either training or test set

  • leakage = not including all data on humans in either training or test set

<p>you want to include all the info on patients in either training or test set</p><ul><li><p><u>leakage</u> = not including all data on humans in either training or test set</p></li></ul><p></p>
27
New cards

What is cross validation?

  • To evaluate (test) your model’s ability to predict new data

  • Detect overfitting or selection bias

<ul><li><p>To evaluate (test) your model’s ability to predict new data </p></li><li><p>Detect overfitting or selection bias</p></li></ul><p></p>
28
New cards

What is a cross validation technique?

K-fold cross-validation

  • leave one out (K-fold cross-validation to the extreme)

*split data into tables and in each batch use as training or test set so all data points are used in the model, calculate accuracy 4 times…

*important for small datasets

<p><strong>K-fold cross-validation</strong></p><ul><li><p>leave one out (K-fold cross-validation to the extreme)</p></li></ul><p>*split data into tables and in each batch use as training or test set so all data points are used in the model, calculate accuracy 4 times…</p><p>*important for small datasets</p>
29
New cards

What are Machine learning pipelines?

Workflows to execute a sequence of tasks

  • Data normalization (scaling)

  • Imputation of missing values

  • Dimensionality Reduction

  • Classification

<p><strong>Workflows to execute a sequence of tasks </strong></p><ul><li><p>Data normalization (scaling) </p></li><li><p>Imputation of missing values </p></li><li><p>Dimensionality Reduction </p></li><li><p>Classification</p></li></ul><p></p>
30
New cards

What is clustering?

Making a new classification system for pokemon characters according to similarities in their characteristic such as strength, speed, defensiveness → features, unsupervised learning, no labels

31
New cards

Lecture 2 - Feature Engineering

32
New cards

What is Missing Values Imputation?

  • In real world datasets, missing input values are very common

  • No standard encoding (blank, 0, “NA”, NaN, Null, ...)

  • Imputation: replacing missing value with estimate for that value

    • Mean / median

    • KNN

    • Model-driven

    • Iterative

*

33
New cards

Describe the 3 approaches for missing values imputation

Mean: Taking the mean gives you an unrealistic data set

  • in image, rounded dots are original feature values, colors indicate labels

  • you can take another approach → KNN

KNN:

  • more realistic bcs all the squares are in the neighborhood of the actual data points, calculated based on things surrounding data point

  • k = number of neighbors you’re taking into account

*can also drop the missing value

<p><strong>Mean: </strong>Taking the mean gives you an unrealistic data set</p><ul><li><p>in image, rounded dots are original feature values, colors indicate labels</p></li><li><p>you can take another approach → KNN</p></li></ul><p></p><p><strong>KNN:</strong></p><ul><li><p>more realistic bcs all the squares are in the neighborhood of the actual data points, calculated based on things surrounding data point</p></li><li><p>k = number of neighbors you’re taking into account</p></li></ul><p></p><p>*can also drop the missing value</p>
34
New cards

What are categorical variables?

  • Data often has categorical (or discrete) features → non-numerical, no order (gender, city…)

  • Remember measurement levels: categorical – ordinal – interval – ratio.

  • Often necessary to represent categorical features as numbers.

    • One Hot encoding

    • Count-based encoding

35
New cards
<p>What is <strong>One Hot Encoding?</strong></p>

What is One Hot Encoding?

  • gives every category a numerical value → ex. 3 colors red,green,blue get 3 vectors of 1,0,0 + 0,1,0 + 0,0,1

*values dont have importance, more abt labeling and organizing

  • important that the math used by machine learning models is not affected by the encoding → impossible to use 1,2,3, ...

  • Adding one feature for each category (feature encodes whether a sample belongs to this category or not)

→ all colours are equally distant from each other

<ul><li><p>gives every category a numerical value → ex. 3 colors red,green,blue get 3 vectors of 1,0,0 + 0,1,0 + 0,0,1</p></li></ul><p>*values dont have importance, more abt labeling and organizing</p><ul><li><p>important that the math used by machine learning models is not affected by the encoding → impossible to use 1,2,3, ...</p></li><li><p>Adding one feature for each category (feature encodes whether a sample belongs to this category or not)</p></li></ul><p></p><p>→ all colours are equally distant from each other</p>
36
New cards

What is Count-Based Encoding?

  • For high cardinality categorical features → ex. countries

  • Instead of 50 one-hot variables, replace label with the value of a variable aggregated over that label.

  • For regression: “people in this state have an average response of y”

  • Binary classification: “people in this state have likelihood p for class 1”

  • Multiclass: One feature per class: probability distribution

*choose the average value of the concept you’re interested in, ex. for every city put the average temp

37
New cards

Describe how digital images are processed

  • The values are all discrete and integers.

  • Can be considered as a large array of discrete dots

  • Each dot has a brightness associated with it.

  • These dots are called picture elements - pixels

images consist of pixels, each pixel has 3 colors, each color has an intensity

  • every pixel is a dot, every dot has either red, green, blue

  • every color has intensity → has 3 values

<ul><li><p>The values are all discrete and <strong>integers</strong>.</p></li><li><p>Can be considered as a large array of discrete dots</p></li><li><p>Each dot has a brightness associated with it.</p></li><li><p>These dots are called picture elements - <strong>pixels</strong></p></li></ul><p></p><p>images consist of pixels, each pixel has 3 colors, each color has an intensity</p><ul><li><p>every pixel is a dot, every dot has either red, green, blue</p></li><li><p>every color has intensity → has 3 values</p></li></ul><p></p>
38
New cards

Describe Arrays and Images

  • Images are represented as matrices (e.g. numpy arrays)

  • Can be written as a function f(x,y)

  • Types of images: Binary Images, Grayscale Images and Color Images

<ul><li><p>Images are represented as matrices (e.g. numpy arrays) </p></li><li><p>Can be written as a function f(x,y) </p></li><li><p>Types of images: Binary Images, Grayscale Images and Color Images</p></li></ul><p></p>
39
New cards

Describe Greyscale Images

  • Each pixel is a shade of gray

  • Normally from 0 (black) to 255 (white). Each pixel can be represented by eight bits, or exactly one byte.

  • Other grayscale ranges are used, but generally are a power of 2. (22 = 4, 24 = 64)

*only have one intensity value

<ul><li><p>Each pixel is a shade of gray</p></li><li><p>Normally from 0 (black) to 255 (white). Each pixel can be represented by eight bits, or exactly one byte.</p></li><li><p>Other grayscale ranges are used, but generally are a power of 2. (22 = 4, 24 = 64)</p></li></ul><p></p><p>*only have one intensity value</p>
40
New cards

Describe Multi-channel Images

  • Such images is a stack of multiple matrices; representing the multiple channel values for each pixel

  • Ex. RGB color is described by the amount of red, green and blue in it

*3 ranges, 3 values for every pixel, intensities for each color

<ul><li><p>Such images is a stack of multiple matrices; representing the multiple channel values for each pixel</p></li><li><p>Ex. RGB color is described by the amount of red, green and blue in it</p></li></ul><p>*3 ranges, 3 values for every pixel, intensities for each color</p>
41
New cards
<p>What are measures for segmentation for <strong>machine learning with images</strong>?</p>

What are measures for segmentation for machine learning with images?

  • A segmentation result can be measured if the ground truth is known

  • Empirical Measures:

    • Accuracy, Precision and Recall

    • F-score, Jaccard Index

42
New cards
<p>What is accuracy and what are problems with it?</p>

What is accuracy and what are problems with it?

Accuracy: sum of diagonals over everything → TP+TN / TP+TN+FP+FN

Problems with accuracy:

  • Imbalanced classes lead to hard-to-interpret accuracy

positive = there, negative = not there

*want to minimize false positives in SPAM email case for example

*false negative = you miss smth, detrimental in healthcare diagnosis instances

<p><strong>Accuracy</strong>: sum of diagonals over everything → TP+TN / TP+TN+FP+FN</p><p><strong>Problems with accuracy</strong>:</p><ul><li><p>Imbalanced classes lead to hard-to-interpret accuracy</p></li></ul><p></p><p>positive = there, negative = not there</p><p>*want to minimize false positives in SPAM email case for example</p><p>*false negative = you miss smth, detrimental in healthcare diagnosis instances</p>
43
New cards

Precision, Recall, F-score

precision is inversely proportional with false positives → use to minimize FP

recall is inversely proportional with FN → use to minimize FN

*use when data set isn’t balanced

<p>precision is inversely proportional with false positives → use to minimize FP</p><p>recall is inversely proportional with FN → use to minimize FN</p><p>*use when data set isn’t balanced</p>
44
New cards

Describe how machine learning works with text data

  • Most Machine Learning algorithms prefer to work with numbers

  • So far:

    • Fixed number of features

    • Continuous

    • Categorical

  • Working with Text Data

    • no pre-defined features

    • Need to create fixed-length descriptions

*super unstructured and abundant

45
New cards

Features from text: Bag of words

  • split into parts: tokenizer

  • build a vocabulary with all the words

  • create a dictionary

  • represent into numerical form

1 = word is there

0 = word is not there

length of array = how many words you have in the document

*more 1’s means more words present

<ul><li><p>split into parts: tokenizer</p></li><li><p>build a vocabulary with all the words</p></li><li><p>create a dictionary</p></li><li><p>represent into numerical form</p></li></ul><p></p><p>1 = word is there</p><p>0 = word is not there</p><p>length of array = how many words you have in the document</p><p>*more 1’s means more words present</p>
46
New cards

Describe Text Data Reprocessing

  • Tokenization — convert sentences to words

  • Removing unnecessary punctuation, tags

  • Removing stop words— frequent words such as ”the”, ”is”, etc. that have low semantic content

  • Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.

  • Lemmatization —Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language

47
New cards

What is Tokenization?

  • Process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens

  • The list of tokens turns into input for in additional processing including parsing or text mining

  • Tokenization can swap out sensitive data

    • Ex. Typically payment card or bank account numbers—with a randomized number in the same format

48
New cards

What is Stemming and Lemmatization?

Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.

  • The stemmed form of studies is: studi

  • The stemmed form of studying is: study

Lemmatization —Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language

  • The lemmatized form of studies is: study

  • The lemmatized form of studying is: study

49
New cards

How do you Restrict the Vocabulary?

  • Removing unnecessary punctuation, tags

  • Removing stop words— frequent words such as ”the”, ”is”, etc. that have low semantic content

  • Removing infrequent words

    • Words that appear only once or twice might not be helpful

    • Restrict vocabulary size to only most frequent words (for less features)

50
New cards

What is Bag of Words?

  • Most common technique to numerically represent text is Bag of Words.

  • Represents each sentence or document as a vector with a value for each word in the vocabulary.

    • Binary: word present or absent in the document

    • Count: how often the word appears in the document

    • Popular approach: Term Frequency x Inverse Document Frequency (TF-IDF)

51
New cards

What is Term Frequency-Inverse Document Frequency (TF-IDF)?

  • Term Frequency (TF) = Number of times term t appears in a document/Number of terms in the document

  • Inverse Document Frequency (IDF) = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.

  • We calculate TF-IDF value of a term as = TF * IDF

<ul><li><p><strong>Term Frequency (TF)</strong> = Number of times term t appears in a document/Number of terms in the document</p></li><li><p><strong>Inverse Document Frequency (IDF) </strong>= log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.</p></li><li><p>We calculate <strong>TF-IDF </strong>value of a term as = TF * IDF</p></li></ul><p></p>
52
New cards

Working with Text Data

knowt flashcard image
53
New cards

Describe Feature Selection

Why select features?

  • Avoid overfitting

  • Faster prediction and training

  • Less storage for model and dataset

Strategies

  • Univariate statistics

  • Model-based selection

  • Iterative selection

54
New cards

What are Univariate Statistics?

  • Look at each feature individually

  • Features will be removed if they do not have a significant relationship with the target

  • Features that are significant only in combination with another feature (interaction) will be removed.

  • Selecting features with highest confidence is related to ANOVA (from statistics)

Pick statistic, check p-values!

f_regression, f_classif, chi2 in scikit-learn

<ul><li><p>Look at each feature individually </p></li><li><p>Features will be removed if they do not have a significant relationship with the target </p></li><li><p>Features that are significant only in combination with another feature (interaction) will be removed. </p></li><li><p>Selecting features with highest confidence is related to ANOVA (from statistics)</p></li></ul><p></p><p>Pick statistic, check p-values! </p><p>f_regression, f_classif, chi2 in scikit-learn</p>
55
New cards

What is Model-Based Feature Selection?

  • Get best fit for a particular model

  • Ideally: exhaustive search over all possible combinations

  • Exhaustive is infeasible (and has multiple testing issues)

  • Use heuristics in practice

56
New cards

What is Model based (single fit)?

  • Build a model, select features most important to model.

  • Lasso, other linear models, tree-based models

  • Multivariate – linear models assume linear relation

<ul><li><p>Build a model, select features most important to model. </p></li><li><p>Lasso, other linear models, tree-based models </p></li><li><p>Multivariate – linear models assume linear relation</p></li></ul><p></p>
57
New cards

What is iterative Model-Based Selection?

  • Forwards: Start with single feature, find most important feature, add, iterate

  • Backwards: Fit model, find least important feature, remove, iterate

  • Computationally expensive

58
New cards

RFE : Recursive feature elimination and selection

knowt flashcard image
59
New cards

LECTURE 3 - KNN

60
New cards

What is k-NN? (k-nearest neighbors)

  • way of labeling data points in the absence of labels in a dataset

  • k-NN can be used for both regression and classification problems → need to label data points and predict the label, you look at the nearest neighbors to determine the target value

  • algorithm depends on distance

  • k = hyperparameter (choose the value)

  • if k decreases you will have decision boundaries in the dataset → use the closest neighbor, might imply overfitting

    • k of 5 is a larger population to make a prediction

  • downsides: requires a lot of calculations, but doesn’t require a training process

61
New cards
<p>What are the unknown data points? (1-NN)</p>

What are the unknown data points? (1-NN)

knowt flashcard image
62
New cards

What is the k-NN classifier?

  • The hyperparameter k represents the number of labeled neighbours to consider

  • Test points are assigned the majority label of the k nearest neighbours

  • Special cases:

    • k = N: since all datapoints are considered, the predicted label for a test point will always be the the majority label of all datapoints. Equivalent to a majority classifier.

    • Ties: in case of an tie between predicted labels, there are different possibilities. The most common one is random selection from the tied labels

*if k is even, it’s problematic, ambiguity in the process

63
New cards

What is the nearest-neighbor? (3-NN)

  • taking more neighbors into account can change the predicted value

<ul><li><p>taking more neighbors into account can change the predicted value</p></li></ul><p></p>
64
New cards
<p>What is K-Nearest Neighbours Hypothesis Space?</p>

What is K-Nearest Neighbours Hypothesis Space?

  • two-dimensional feature space split into some regions whose centroid are the data points

  • concept seen in nature → skin of giraffes

65
New cards

What is the influence of k on the decision boundary?

  • smaller k, you take into account the neighbors closest

  • increasing k means you have a smoother decision boundary

*large k = less complex model bcs you take a lot of data points into decision which makes the decision boundary smoother

<ul><li><p>smaller k, you take into account the neighbors closest</p></li><li><p>increasing k means you have a smoother decision boundary</p></li></ul><p></p><p>*large k = less complex model bcs you take a lot of data points into decision which makes the decision boundary smoother</p>
66
New cards

What is the label (class) of a point on the decision boundary?

it’s ambigious

  • you can’t make a decision on the problem with one feature

  • cant find the closest neighbor

<p><strong>it’s ambigious</strong></p><ul><li><p>you can’t make a decision on the problem with one feature</p></li><li><p>cant find the closest neighbor</p></li></ul><p></p>
67
New cards

Describe weights in k-NN

  • weights help to make machine learning algorithms smoother

  • having 2 close neighbors + 1 far creates a risk → fix by applying neighbors that are closer

  • can make weights uniform (all have equal importance), use distance, make weights different → flexibility of k-NN

  • Extension of the basic algorithm: not all neighbors get an equal vote

  • Distance-weighting: each neighbor has a weight which is based on its distance to the data point to be classified

    • Inverse distance weighting – each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote)

    • Inverse of the square of the distance

    • Kernel functions (Gaussian kernel, tricube kernel)

  • If we change the distance function, the results will change.

  • Implication: with distance weighting, k=n is no longer equivalent to a majority based classifier. Weights in k-NN

<ul><li><p>weights help to make machine learning algorithms smoother</p></li><li><p>having 2 close neighbors + 1 far creates a risk → fix by applying neighbors that are closer</p></li><li><p>can make weights uniform (all have equal importance), use distance, make weights different →<strong> flexibility of k-NN</strong></p></li></ul><p></p><ul><li><p>Extension of the basic algorithm: not all neighbors get an equal vote </p></li><li><p><strong>Distance-weighting</strong>: each neighbor has a weight which is based on its distance to the data point to be classified </p><ul><li><p>Inverse distance weighting – each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote) </p></li><li><p>Inverse of the square of the distance </p></li><li><p>Kernel functions (Gaussian kernel, tricube kernel) </p></li></ul></li><li><p>If we change the distance function, the results will change. </p></li><li><p>Implication: with distance weighting, k=n is no longer equivalent to a majority based classifier. Weights in k-NN</p></li></ul><p></p>
68
New cards

How do you compute distance in k-NN?

Different ways to define the distance function

  • Euclidean distance (straight line)

  • Manhattan distance (distance between projections on the axis)

  • Difference between Euclidean and Manhattan distance

<p>Different ways to define the distance function </p><ul><li><p><strong>Euclidean distance</strong> (straight line) </p></li><li><p><strong>Manhattan distance</strong> (distance between projections on the axis)</p></li><li><p>Difference between Euclidean and Manhattan distance</p></li></ul><p></p>
69
New cards

How does k determine model complexity?

  • The model in k-NN is the decision boundary that separates the classes (In regression, the model is the line that fits the data)

  • Smaller k leads to more complex decision boundaries

  • k too low → danger of overfitting, high complexity

  • k too high → danger of underfitting, low complexity

*doing well on unseen data points = generalization

<ul><li><p>The model in k-NN is the decision boundary that separates the classes (In regression, the model is the line that fits the data) </p></li><li><p>Smaller k leads to more complex decision boundaries </p></li><li><p><strong>k too low → danger of overfitting, high complexity </strong></p></li><li><p><strong>k too high → danger of underfitting, low complexity</strong></p></li></ul><p></p><p>*doing well on unseen data points = generalization</p>
70
New cards

What is the bias variance trade off?

Variance: how sensitive the model is to small changes in the training data.

  • A high-variance model will give very different decision boundaries if you slightly change the dataset.

In kNN:

  • Small 𝑘:?

  • Large 𝑘: ?

Bias: how far the model’s average prediction is from the true underlying relationship.

  • A high-bias model makes strong assumptions and might miss important patterns.

In kNN:

  • Small 𝑘: ?

  • Large 𝑘: ?

A high-variance model will give very different decision boundaries if you slightly change the dataset.

In kNN:

  • Small 𝑘(e.g. 𝑘=1): decision boundary can shift a lot if just one training point changes → high variance.

  • Large 𝑘: more stable, since predictions average over many neighbors → low variance.

A high-bias model makes strong assumptions and might miss important patterns.

In kNN:

  • Small 𝑘: very flexible → low bias, because it can fit even complicated shapes.

  • Large 𝑘: very smooth boundaries → high bias, because it might oversimplify (e.g. blur two nearby classes to

71
New cards

How do you determine model complexity?

  • Depends on complexity of the separation between the classes

  • Start with the simplest model (large k in k-NN), and increase complexity (smaller k)

-trying to see generalization

-overfitting decreases w larger k value

<ul><li><p>Depends on complexity of the separation between the classes</p></li><li><p>Start with the simplest model (large k in k-NN), and increase complexity (smaller k)</p></li></ul><p></p><p>-trying to see generalization</p><p>-overfitting decreases w larger k value</p><p></p>
72
New cards

How do you choose k?

  • Typically odd for an even number of classes (ex. 1, 3, 5, 7..)

  • As you decrease k, accuracy might increase, but so does complexity

  • In other words, a small value of k is likely to lead to overfitting (fitting “noise”)

  • A rule of thumb used by some data-miners: 𝑘 = sqrt(𝑛)

73
New cards

What 3 sets is your data divided into to tune hyperparameters?

  • training

  • validation

  • test

<ul><li><p>training </p></li><li><p>validation </p></li><li><p>test </p></li></ul><p></p>
74
New cards

What is the nearest centroid?

simple method to take the centroid of every cross, calculate one number for one cross and then for new data points you just look at which is closer to either the blue or red cross

<p>simple method to take the centroid of every cross, calculate one number for one cross and then for new data points you just look at which is closer to either the blue or red cross</p>
75
New cards

What is the nearest shrunken centroid?

Nearest centroid classification:

Takes a new sample, and compares it to each of these class centroids. The class whose centroid it is closest to, in squared distance, is the predicted class for that new sample.

Nearest shrunken centroid classification:

  • "shrinks" each of the class centroids toward the overall centroid for all classes by an amount we call the threshold. This shrinkage consists of moving the centroid towards zero by threshold, setting it equal to zero if it hits zero. For example if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of 1.2 would be shrunk to zero.

  • After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids

-increasing threshold = moving closer to the center

-useful when you have a huge number of features

-helps to eliminate the features, unimportant ones

<p><strong><u>Nearest centroid classification:</u></strong></p><p>Takes a new sample, and compares it to each of these class centroids. The class whose centroid it is closest to, in squared distance, is the predicted class for that new sample.</p><p></p><p><u>Nearest shrunken centroid classification:</u></p><ul><li><p>"shrinks" each of the class centroids toward the overall centroid for all classes by an amount we call the <strong>threshold</strong>. This shrinkage consists of moving the centroid towards zero by <strong>threshold</strong>, setting it equal to zero if it hits zero. For example if <strong>threshold</strong> was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of 1.2 would be shrunk to zero.</p></li><li><p>After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids</p></li></ul><p></p><p>-increasing threshold = moving closer to the center</p><p>-useful when you have a huge number of features</p><p>-helps to eliminate the features, unimportant ones</p>
76
New cards

How is nearest centroid problematic? kNN vs nearest centroid

  • nearest centroid would misclassify data points

<ul><li><p>nearest centroid would misclassify data points</p></li></ul><p></p>
77
New cards

Describe classification vs regression

Classification: The model trained from the data defines a decision boundary that separates the data

Regression: The model fits the data to describe the relation between 2 features or between a feature (ex. height) and the label (ex. yes/no)

<p><strong>Classification</strong>: The model trained from the data defines a decision boundary that <strong>separates</strong> the data</p><p><strong>Regression:</strong> The model <strong>fits</strong> the data to describe the relation between 2 features or between a feature (ex. height) and the label (ex. yes/no)</p>
78
New cards
<p>What is k-NN regression?</p>

What is k-NN regression?

  • k-NN classification combines the discrete predictions of k-neighbours

  • k-NN regression combines continuous predictions

  • k-NN regression fits the best line between the neighbors

in the diagram you use weights, 2 versions of the same regression task, top uniform, bottom distance

<ul><li><p>k-NN classification combines the discrete predictions of k-neighbours</p></li><li><p>k-NN regression combines continuous predictions </p></li><li><p>k-NN regression fits the best line between the neighbors</p></li></ul><p></p><p>in the diagram you use weights, 2 versions of the same regression task, top uniform, bottom distance</p>
79
New cards

What are 3 k-NN advantages?

  • The cost of the learning process is zero

  • No assumptions about the characteristics of the concepts to learn have to be done

  • Complex concepts can be learned by local approximation using simple procedures

80
New cards

Describe kNN for missing values imputation

knowt flashcard image
81
New cards

What are 3 k-NN disadvantages?

  • The model can not be interpreted (there is no description of the learned concepts)

  • It is computationally expensive to find the k nearest neighbours when the dataset is very large

  • Performance depends on the number of dimensions that we have (curse of dimensionality)

82
New cards
<p>What is the curse of dimensionality and overfitting?</p>

What is the curse of dimensionality and overfitting?

Slide 1:

  • More information is needed for classification, therefore we add a second feature

  • Feature 2: average amount of green color in image

Slide 2:

  • Even more information is needed for classification, therefore we add a third feature

  • Feature 3: average amount of blue color in image

Slide 3:

  • In three dimensions (= three features), perfect separation of CATS and DOGS is possible with a decision boundary (plane)

<p><u>Slide 1:</u></p><ul><li><p>More information is needed for classification, therefore we add a second feature </p></li></ul><ul><li><p>Feature 2: average amount of green color in image</p></li></ul><p></p><p><u>Slide 2:</u></p><ul><li><p>Even more information is needed for classification, therefore we add a third feature </p></li><li><p>Feature 3: average amount of blue color in image</p></li></ul><p></p><p><u>Slide 3:</u></p><ul><li><p>In three dimensions (= three features), perfect separation of CATS and DOGS is possible with a decision boundary (plane)</p></li></ul><p></p>
83
New cards

In the dog-cat example, Does adding features improve classification?

This example suggests that by adding (informative) features, classification is improved. This is often the case, but... Adding new features increases the volume of feature space exponentially

  • For instance: 1 feature has 10 different values

    • 1 feature: 10 possible feature values

    • 2 features: 100 possible feature values

    • 3 features: 1000 possible feature values

as more features are added, data becomes sparse, distances lose meaning, and algorithms that depend on “closeness” struggle.

  • Examples? k-nn

  • Solution: models less sensitive to raw distances (ex. trees, boosted methods

  • Or dimensionality reduction, feature selection

<p>This example suggests that by adding (informative) features, classification is improved. This is often the case, but... <u>Adding new features increases the volume of feature space exponentially </u></p><ul><li><p>For instance: 1 feature has 10 different values </p><ul><li><p>1 feature: 10 possible feature values </p></li><li><p>2 features: 100 possible feature values </p></li><li><p>3 features: 1000 possible feature values</p></li></ul></li></ul><p></p><p><u>as more features are added, data becomes sparse, distances lose meaning, and algorithms that depend on “closeness” struggle.</u></p><ul><li><p><u>Examples</u>? k-nn </p></li><li><p><u>Solution</u>: models less sensitive to raw distances (ex. trees, boosted methods </p></li><li><p>Or dimensionality reduction, feature selection</p></li></ul><p></p>