Machine Learning - MIDTERMS

0.0(0)

Studied by 5 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/90

Earn XP

Description and Tags

4 TOPICS IN ONE

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

91 Terms

New cards

Numerical Data

Integers or floating-point values that behave like numbers

They are additive, countable, ordered, and so on

New cards

Examples of Numerical Data

Examples are:

Temperature

Weight

The number of deer wintering in a nature preserve

New cards

Feature Vector

Floating-point values comprising one example

New cards

Feature Engineering

The best way to represent raw dataset values as trainable values in the feature vector

Vital part of machine learning

Most common techniques are Normalization and Binning

New cards

Normalization

Converting numerical values into a standard range

Converts floating-point numbers to a constrained range that improves model training

New cards

Binning (bucketing)

Converting numverical values into bucket of ranges

Also called bucketing is a feature engineering technique that groups different numberical subranges into bins or buckets

It turns numerical data into categorical data

New cards

First Steps in creating feature vectors

Visualize your data in plots or graphs

Get Statistics about your data

New cards

Outlier

A value distant from most other values in a feature or label

It often cause problems in model training, finding it is important

New cards

Finding Outliers

When the delta between the 0th and 25th percentiles differs significantly from the delta between 75th and 100th percentiles

New cards

Categories of Outliers

Categories:

Due to a mistake

A legitimate data point

New cards

Importing the dataset in pandas

training_df = pd.read_csv()

New cards

Getting Basic Statistics in Pands

.describe()

New cards

Goal of Normalization

To transform features to be on a similar scale

New cards

Normalization Benefits

Benefits:

Helps models converge more quickly during training

Helps models infer better predictions

Helps avoid the NaN trap when feature values are very high

Helps the model learn appropriate weights for each feature

New cards

NaN

Not a number

New cards

NaN Trap

When a value in model exceeds the floating0point precision limit, the system sets the value to NaN instead of a number.

When one number in the model becomes a NaN, other numbers in the model also eventually becomes a NaN

New cards

Three Popular Normalization Methods

3 Methods:

Linear Scaling

Z-score scaling

Log Scaling

New cards

Linear Scaling

Converting floating-point values from their natural range into a standard-range usually 0 to 1 or -1 to +1

When the feature is uniformly distributed across a fixed range

New cards

Linear Scaling Conditions

The lower and upper bounds of your data don’t change much over time

The feature contains few or no outliers, and those outliers aren’t extreme

The feature is approximately uniformly distributed across its range. That is, a histogram would show roughly bars for most values

New cards

Z-Score

The number of standard deviations a value is from the mean

New cards

Z-Score Scaling

Storing that feature’s z-score in the feature vector

When the feature distribution does not contain extreme outliers

New cards

Log Scaling

Computes the logarithm of the raw value. In theory, the logarithm could be any base; in practice, log scaling usually calculates the natural logarithm (ln)

When the feature conforms to the power law

New cards

Power Law Distribution

Low values of X have very high values of Y

As values of X increase, the values of Y quickly decrease. Consequently, high values of X have very low values of Y

New cards

Clipping

A normalization technique to minimize the influence of extreme outliers

Usually caps (reduces) the value of outliers to a specific maximum value

New cards

Binning Conditions

It is a good alternative to scaling or clipping when either of the following conditions are met:

The overall linear relationship between the feature and the label is weak or nonexistent

When the feature values are clustered

New cards

Quantile Bucketing

Creates bucketing boundaries such that the number of examples in each bucket is exactly or nearly equal

It mostly hides the outliers

New cards

Scrubbing

The process of cleaning and fixing data by removing or correcting problematic examples like omitted values, duplicates, out-of-range values, and bad labels

New cards

Omitted Values

This occur when data is missing or not recorded, such as when a census taker fails to record a resident’s age

New cards

Duplicate Examples

When the same data appears multiple times in a dataset, such as when a server uploads the same logs twice

New cards

Out-of-range feature values

These are data points that fall outside the expected or valid range for that feature, often due to human error or equipment malfunctions, such as when a human accidentally types an extra digit

New cards

Bad Labels

this is when data is incorrectly labeled or categorized, such as when a human evaluator mislables a picture of an oak tree as a maple

New cards

Qualities of Good Numerical Features

Clearly named

Checked or Tested before Training

Sensible

New cards

Clearly Named

Each feature should have a clear, sensible, and obvious meaning to any human on the project

New cards

Checked or Tested before Training

Features should be checked for appropriate values and outliers

New cards

Sensible

Avoid magic values (purposeful discontinuities in continuous features)

Uses separate Boolean indicators

New cards

Polynomial Transform

A technique where a new, synthetic feature is created by raising an existing numerical feature to a power (e.g., squaring, cubing). For example, creating x₂ = x₁².

New cards

Use of Polynomial Transform

They allow linear models to capture non-linear relationships in the data. This is useful when the relationship between features and the target variable isn't linear, or when data points from different classes cannot be separated by a straight line but might be separable by a curve.

New cards

Categorical data

This data has a specific set of possible values such as:

Different Species

Names of streets in a particular city

Whether email is spam

Colors that house exteriors are paintet

New cards

Encoding

This means converting categorical or other data to numerical vectors that a model can train on

New cards

Dimension

Synonym for the number of elements in a feature vector

New cards

Vocabulary

When a categorical feature has a low number of possible categories, you can encode it as _____

New cards

Index Numbers

ML models can only manipulate floating-point numbers, therefore, you must convert each string to a unique ____ _____

New cards

One-hot Encoding

The next step in building a vocabulary is to convert each index number to its _____

Each category is represented by a vector of N elements, where N is the number of categories

Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value of 0.0

New cards

Multi-hot encoding

Multiple values can be 1.0

New cards

Sparse Representation

A feature whose values are predominately zero (or empty)

This means storing the position of the 1.0 in a sparse vector

Consumes far less memory than the eight-element one0hot vector

New cards

Out-of-vocabulary Category (OOV)

Where you store the outlier categories into a single outlier bucket

The system learns a single weight for that outlier bucket

New cards

Embeddings

This substantially reduce the number of dimensions, which benefits models by making the model train faster and infer predictions more quickly

New cards

Feature Hashing

Less common way to reduce the number of dimension

Used when the number of categorical feature values is very large

New cards

Human Raters

This is the gold label of rating data, more desirable than machine-labeled data due to better data quality

There is still human errors, bias and malice

New cards

Inter-rater agreement

Any two human beings may label the same example differently, this is the difference between the raters’ decision

New cards

Machine Raters

Categories are automatically determined by one or more classification models

Often referred as silver labels

New cards

High dimensionality

Feature vectors having a large number of eelements

This increases training costs and making training more difficult

New cards

Feature crosses

Created by crossing two or more categorical or bucketed features of the dataset

Allow linear models to handle nonlinearities

New cards

Polynomial Transforms vs Feature Crosses

The first one combines numerical data while the latter combines categorical data

New cards

Types of Data

Numerical Data

Categorical Data

Human Language

Multimedia

Outputs from other ML systems

Embedding vectors

New cards

Reliability of Data

Refers to the degree to which you can trust your data

A model trained on this dataset is more likely to yield useful predictions

New cards

Measuring reliability

How common are label errors?

Are your features noisy?

Is data properly filtered for your problem?

New cards

Unreliable Data Causes

Omitted Values

Duplicate Examples

Bad feature values

Bad labels

Bad Sections of Data

New cards

Imputation

It is the process of generating well-reasoned data, not random or deceptive data.

Good _____ can improve your model; but bad ______ can hurt your data

New cards

Direct Labels

Labels identical to the prediction your model is trying to make

The prediction your model is trying to make is exactly as a column in your dataset

Better than proxy label

New cards

Proxy Labels

Labels that are similar—but not identical—to the prediction your model is trying to make

A compromise—an imperfect approximation of a direct label

New cards

Human-generated data

One or more humans examine some info and provide value, usually for the label

New cards

Automatically-generated Data

Software determines the value of a Data

New cards

Imbalanced Datasets

One label is more common than the other label

The predominant label is called the majority class while the less common is called the minority class

Don’t contain enough minority class examples to train a model properly

New cards

Mild Degree of Imbalance

20-40% of the dataset belonging to the minority class

New cards

Moderate Degree of Imbalance

1-20% of the dataset belonging to the minority class

New cards

Extreme Degree of Imbalance

<1%of the dataset belonging to the minority class

New cards

Downsampling

Means training on a disproportionately low subset of the majority class examples

Extract random examples from the dominant class

New cards

Upweighting

Means adding an example weight to the downsampled class equal to the factor by which you downsampled

Add weight to the downsampled examples

New cards

Rebalancing Ratio

This answers the question: How much should you downsample and upweight to rebalance your dataset

You would do this just as you would experiment with other hyperparameters

New cards

Testing a model

Doing this on different examples is stronger proof of your model’s fitness than testing on the same set of examples

New cards

Approach to Dividing the Dataset

Three Subsets:

Training Set

Validation Set

Test Set

New cards

Validation Set

This set performs the initial testing on the model as it is being trained

Used to evaluate results from the training se

New cards

Test Set

This set is used to double-check the model

New cards

Problems with test Sets

Many of the examples in the test set are duplicates of examples in the training set

New cards

Criteria of a Good Test Set

Large Enough to yield statistically significant testing results

Representative of the dataset as a whole

Representative of the real-world data that the model will encounter

Zero examples duplicated in the training set

New cards

What to do with Too much data

Sample the data when there is plenty

New cards

Generalization

Broadening the knowledge of the model to make it useful to real-world data

The opposite of overfitting

A model that makes good predictions on new data

New cards

Overfitting

Means creating a model that matches (memorizes) the training set so closely that the model fails to make correct predictions on new data

Performs well in the lab but is worthless in the real world

New cards

Underfit model

Doesn’t even make good predictions on the training data

Is like a product that doesn’t even do well in the lab

New cards

Detecting Overfiting

These can help you detect it:

Loss curves

Generalization curves

The loss on the training data decreases over time but the loss on validation data is increasing

New cards

Overfitting Causes

The training set doesn’t adequately represent real life data (or the validation set or test set)

The model is too complex

New cards

Generalization conditions

Examples must be independently and identically distributed, the examples can’t influence each other

The dataset is stationary, meaning the dataset doesn’t change significantly over time

The dataset partitions have the same distribution, the examples in the training set are statistically similar to the examples in the validation set, test set, and real-world data

New cards

Occam’s Razor

The preference for simplicity in philosophy

This suggests that simple models generalizes better than the complex model

New cards

Regularization

One approach to keeping a model simple

Forcing the model to become simpler during training

New cards

L₂ regularization

Large weights can have a huge impact

Weights close to zero doesn’t have an impact

This encourages weights toward 0, but never pushes weights all the way to zero which means it can never remove features from the model

New cards

Regularization Rate (lambda)

Model developers tune the overall impact of complexity on model training by multiplying its value by this scallar

New cards

High Regularization Rate:

Strengthens the influence of regularization, thereby reducing the chances of overfitting

Tends to produce a histogram of model weights having the following characteristics:

a normal distribution
a mean weight of 0

New cards

Low Regularization Rate

Lowers the influence of regularization, thereby increasing the chances of overfitting

Tends to produce a histogram of model weights with a flat distribution

New cards

Removing Regularization

Setting the regularization rate to zero

Training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk

New cards

Early stopping

A regularization method that doesn’t involve a calculation of complexity

Ending training before the model fully converges

It usually increases training loss, it can decrease test loss