Machine Learning - MIDTERMS

0.0(0)
studied byStudied by 5 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/90

flashcard set

Earn XP

Description and Tags

4 TOPICS IN ONE

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

91 Terms

1
New cards

Numerical Data

Integers or floating-point values that behave like numbers

They are additive, countable, ordered, and so on

2
New cards

Examples of Numerical Data

Examples are:

Temperature

Weight

The number of deer wintering in a nature preserve

3
New cards

Feature Vector

Floating-point values comprising one example

4
New cards

Feature Engineering

The best way to represent raw dataset values as trainable values in the feature vector

Vital part of machine learning

Most common techniques are Normalization and Binning

5
New cards

Normalization

Converting numerical values into a standard range

Converts floating-point numbers to a constrained range that improves model training

6
New cards

Binning (bucketing)

Converting numverical values into bucket of ranges

Also called bucketing is a feature engineering technique that groups different numberical subranges into bins or buckets

It turns numerical data into categorical data

7
New cards

First Steps in creating feature vectors

Visualize your data in plots or graphs

Get Statistics about your data

8
New cards

Outlier

A value distant from most other values in a feature or label

It often cause problems in model training, finding it is important

9
New cards

Finding Outliers

When the delta between the 0th and 25th percentiles differs significantly from the delta between 75th and 100th percentiles

10
New cards

Categories of Outliers

Categories:

Due to a mistake

A legitimate data point

11
New cards

Importing the dataset in pandas

training_df = pd.read_csv()

12
New cards

Getting Basic Statistics in Pands

.describe()

13
New cards

Goal of Normalization

To transform features to be on a similar scale

14
New cards

Normalization Benefits

Benefits:

Helps models converge more quickly during training

Helps models infer better predictions

Helps avoid the NaN trap when feature values are very high

Helps the model learn appropriate weights for each feature

15
New cards

NaN

Not a number

16
New cards

NaN Trap

When a value in model exceeds the floating0point precision limit, the system sets the value to NaN instead of a number.

When one number in the model becomes a NaN, other numbers in the model also eventually becomes a NaN

17
New cards

Three Popular Normalization Methods

3 Methods:

Linear Scaling

Z-score scaling

Log Scaling

18
New cards

Linear Scaling

Converting floating-point values from their natural range into a standard-range usually 0 to 1 or -1 to +1

When the feature is uniformly distributed across a fixed range

19
New cards

Linear Scaling Conditions

The lower and upper bounds of your data don’t change much over time

The feature contains few or no outliers, and those outliers aren’t extreme

The feature is approximately uniformly distributed across its range. That is, a histogram would show roughly bars for most values

20
New cards

Z-Score

The number of standard deviations a value is from the mean

21
New cards

Z-Score Scaling

Storing that feature’s z-score in the feature vector

When the feature distribution does not contain extreme outliers

22
New cards

Log Scaling

Computes the logarithm of the raw value. In theory, the logarithm could be any base; in practice, log scaling usually calculates the natural logarithm (ln)

When the feature conforms to the power law

23
New cards

Power Law Distribution

Low values of X have very high values of Y

As values of X increase, the values of Y quickly decrease. Consequently, high values of X have very low values of Y

24
New cards

Clipping

A normalization technique to minimize the influence of extreme outliers

Usually caps (reduces) the value of outliers to a specific maximum value

25
New cards

Binning Conditions

It is a good alternative to scaling or clipping when either of the following conditions are met:

The overall linear relationship between the feature and the label is weak or nonexistent

When the feature values are clustered

26
New cards

Quantile Bucketing

Creates bucketing boundaries such that the number of examples in each bucket is exactly or nearly equal

It mostly hides the outliers

27
New cards

Scrubbing

The process of cleaning and fixing data by removing or correcting problematic examples like omitted values, duplicates, out-of-range values, and bad labels

28
New cards

Omitted Values

This occur when data is missing or not recorded, such as when a census taker fails to record a resident’s age

29
New cards

Duplicate Examples

When the same data appears multiple times in a dataset, such as when a server uploads the same logs twice

30
New cards

Out-of-range feature values

These are data points that fall outside the expected or valid range for that feature, often due to human error or equipment malfunctions, such as when a human accidentally types an extra digit

31
New cards

Bad Labels

this is when data is incorrectly labeled or categorized, such as when a human evaluator mislables a picture of an oak tree as a maple

32
New cards

Qualities of Good Numerical Features

Clearly named

Checked or Tested before Training

Sensible

33
New cards

Clearly Named

Each feature should have a clear, sensible, and obvious meaning to any human on the project

34
New cards

Checked or Tested before Training

Features should be checked for appropriate values and outliers

35
New cards

Sensible

Avoid magic values (purposeful discontinuities in continuous features)

Uses separate Boolean indicators

36
New cards

Polynomial Transform

A technique where a new, synthetic feature is created by raising an existing numerical feature to a power (e.g., squaring, cubing). For example, creating x₂ = x₁².

37
New cards

Use of Polynomial Transform

They allow linear models to capture non-linear relationships in the data. This is useful when the relationship between features and the target variable isn't linear, or when data points from different classes cannot be separated by a straight line but might be separable by a curve.

38
New cards

Categorical data

This data has a specific set of possible values such as:

Different Species

Names of streets in a particular city

Whether email is spam

Colors that house exteriors are paintet

39
New cards

Encoding

This means converting categorical or other data to numerical vectors that a model can train on

40
New cards

Dimension

Synonym for the number of elements in a feature vector

41
New cards

Vocabulary

When a categorical feature has a low number of possible categories, you can encode it as _____

42
New cards

Index Numbers

ML models can only manipulate floating-point numbers, therefore, you must convert each string to a unique ____ _____

43
New cards

One-hot Encoding

The next step in building a vocabulary is to convert each index number to its _____

Each category is represented by a vector of N elements, where N is the number of categories

Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value of 0.0

44
New cards

Multi-hot encoding

Multiple values can be 1.0

45
New cards

Sparse Representation

A feature whose values are predominately zero (or empty)

This means storing the position of the 1.0 in a sparse vector

Consumes far less memory than the eight-element one0hot vector

46
New cards

Out-of-vocabulary Category (OOV)

Where you store the outlier categories into a single outlier bucket

The system learns a single weight for that outlier bucket

47
New cards

Embeddings

This substantially reduce the number of dimensions, which benefits models by making the model train faster and infer predictions more quickly

48
New cards

Feature Hashing

Less common way to reduce the number of dimension

Used when the number of categorical feature values is very large

49
New cards

Human Raters

This is the gold label of rating data, more desirable than machine-labeled data due to better data quality

There is still human errors, bias and malice

50
New cards

Inter-rater agreement

Any two human beings may label the same example differently, this is the difference between the raters’ decision

51
New cards

Machine Raters

Categories are automatically determined by one or more classification models

Often referred as silver labels

52
New cards

High dimensionality

Feature vectors having a large number of eelements

This increases training costs and making training more difficult

53
New cards

Feature crosses

Created by crossing two or more categorical or bucketed features of the dataset

Allow linear models to handle nonlinearities

54
New cards

Polynomial Transforms vs Feature Crosses

The first one combines numerical data while the latter combines categorical data

55
New cards

Types of Data

Numerical Data

Categorical Data

Human Language

Multimedia

Outputs from other ML systems

Embedding vectors

56
New cards

Reliability of Data

Refers to the degree to which you can trust your data

A model trained on this dataset is more likely to yield useful predictions

57
New cards

Measuring reliability

How common are label errors?

Are your features noisy?

Is data properly filtered for your problem?

58
New cards

Unreliable Data Causes

Omitted Values

Duplicate Examples

Bad feature values

Bad labels

Bad Sections of Data

59
New cards

Imputation

It is the process of generating well-reasoned data, not random or deceptive data.

Good _____ can improve your model; but bad ______ can hurt your data

60
New cards

Direct Labels

Labels identical to the prediction your model is trying to make

The prediction your model is trying to make is exactly as a column in your dataset

Better than proxy label

61
New cards

Proxy Labels

Labels that are similar—but not identical—to the prediction your model is trying to make

A compromise—an imperfect approximation of a direct label

62
New cards

Human-generated data

One or more humans examine some info and provide value, usually for the label

63
New cards

Automatically-generated Data

Software determines the value of a Data

64
New cards

Imbalanced Datasets

One label is more common than the other label

The predominant label is called the majority class while the less common is called the minority class

Don’t contain enough minority class examples to train a model properly

65
New cards

Mild Degree of Imbalance

20-40% of the dataset belonging to the minority class

66
New cards

Moderate Degree of Imbalance

1-20% of the dataset belonging to the minority class

67
New cards

Extreme Degree of Imbalance

<1%of the dataset belonging to the minority class

68
New cards

Downsampling

Means training on a disproportionately low subset of the majority class examples

Extract random examples from the dominant class

69
New cards

Upweighting

Means adding an example weight to the downsampled class equal to the factor by which you downsampled

Add weight to the downsampled examples

70
New cards

Rebalancing Ratio

This answers the question: How much should you downsample and upweight to rebalance your dataset

You would do this just as you would experiment with other hyperparameters

71
New cards

Testing a model

Doing this on different examples is stronger proof of your model’s fitness than testing on the same set of examples

72
New cards

Approach to Dividing the Dataset

Three Subsets:

Training Set

Validation Set

Test Set

73
New cards

Validation Set

This set performs the initial testing on the model as it is being trained

Used to evaluate results from the training se

74
New cards

Test Set

This set is used to double-check the model

75
New cards

Problems with test Sets

Many of the examples in the test set are duplicates of examples in the training set

76
New cards

Criteria of a Good Test Set

Large Enough to yield statistically significant testing results

Representative of the dataset as a whole

Representative of the real-world data that the model will encounter

Zero examples duplicated in the training set

77
New cards

What to do with Too much data

Sample the data when there is plenty

78
New cards

Generalization

Broadening the knowledge of the model to make it useful to real-world data

The opposite of overfitting

A model that makes good predictions on new data

79
New cards

Overfitting

Means creating a model that matches (memorizes) the training set so closely that the model fails to make correct predictions on new data

Performs well in the lab but is worthless in the real world

80
New cards

Underfit model

Doesn’t even make good predictions on the training data

Is like a product that doesn’t even do well in the lab

81
New cards

Detecting Overfiting

These can help you detect it:

Loss curves

Generalization curves

The loss on the training data decreases over time but the loss on validation data is increasing

82
New cards

Overfitting Causes

The training set doesn’t adequately represent real life data (or the validation set or test set)

The model is too complex

83
New cards

Generalization conditions

Examples must be independently and identically distributed, the examples can’t influence each other

The dataset is stationary, meaning the dataset doesn’t change significantly over time

The dataset partitions have the same distribution, the examples in the training set are statistically similar to the examples in the validation set, test set, and real-world data

84
New cards

Occam’s Razor

The preference for simplicity in philosophy

This suggests that simple models generalizes better than the complex model

85
New cards

Regularization

One approach to keeping a model simple

Forcing the model to become simpler during training

86
New cards

L2 regularization

Large weights can have a huge impact

Weights close to zero doesn’t have an impact

This encourages weights toward 0, but never pushes weights all the way to zero which means it can never remove features from the model

<p>Large weights can have a huge impact</p><p>Weights close to zero doesn’t have an impact</p><p>This encourages weights toward 0, but never pushes weights all the way to zero which means it can never remove features from the model</p>
87
New cards

Regularization Rate (lambda)

Model developers tune the overall impact of complexity on model training by multiplying its value by this scallar

88
New cards

High Regularization Rate:

Strengthens the influence of regularization, thereby reducing the chances of overfitting

Tends to produce a histogram of model weights having the following characteristics:

  • a normal distribution

  • a mean weight of 0

89
New cards

Low Regularization Rate

Lowers the influence of regularization, thereby increasing the chances of overfitting

Tends to produce a histogram of model weights with a flat distribution

90
New cards

Removing Regularization

Setting the regularization rate to zero

Training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk

91
New cards

Early stopping

A regularization method that doesn’t involve a calculation of complexity

Ending training before the model fully converges

It usually increases training loss, it can decrease test loss