intro to ml- regression & decision tree

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/59

Earn XP

Description and Tags

Computer Science

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

60 Terms

New cards

Supervised Learning- Input

(𝑥 ⁽¹⁾ , 𝑦 ⁽¹⁾ ), …, (𝑥 ^(𝑚) , 𝑦 ^(𝑚)) where 𝑥 ^(𝑖) is the ith data point and 𝑦 ^(𝑖)is the ith label associated to this data point.

New cards

Supervised Learning- Want

find a function f such that 𝑓(𝑥^(𝑖)) is a good approximation to 𝑦^(𝑖)

New cards

Supervised Learning- Goal

Use function f to predict unseen data points x

New cards

Supervised learning- X

attributes, predictor variables, or independent variables

New cards

Supervised Learning- Y

classes, target variable, or dependent variable

concept to be learned

New cards

hypothesis space

set of functions f:X→Y

New cards

Classification

Learning how to separate/differentiate two or more classes (discrete or categorical target).
Learning the separating boundary

New cards

Classification Examples

Fraud/Not-Fraud, Spam/Not-Spam, Sentiment Analysis, Image Classification (fruits, cars, etc.)

New cards

Regression

Learning how to predict a continuous numerical target

New cards

Regression Examples

predicting temperature or house/stock prices

New cards

Linear Regression- Input

(𝑥 ⁽¹⁾ , 𝑦 ⁽¹⁾ ), …, (𝑥 ^(𝑚), 𝑦 ^(𝑚)) where 𝑥 ^(𝑖) ∈ ℝ and 𝑦 ^(𝑖) ∈ ℝ.

New cards

Linear Regression- Hypothesis Space

set of linear functions 𝑓 𝑥 = 𝑎𝑥 + b, where 𝑎 ∈ ℝ and 𝑏 ∈ ℝ

New cards

Linear Regression- Goal

find a function f such that 𝑓 𝑋 ≈ Y, minimizing the error metric

New cards

Linear Regression- Error

computed based on the difference between predicted and actual values

New cards

optimal linear hypothesis

minimizes squared differences between predicted value and actual value

New cards

Gradient Descent

iteratively change parameter 𝜃 until reaching the minimum value for 𝐿(𝜃)

New cards

Gradient

derivative at point

New cards

Gradient Descent Convergence

Gradient Descent will converge for convex functions, but the learning rate choice is crucial for that

New cards

Non-convex functions

can get stuck in a local minima

New cards

evaluate the learned hypothesis on

test data, not training data to avoid overfitting

New cards

Classification Input

(x⁽¹⁾, y⁽¹⁾),..., (x^(m), y^(m)) where x⁽ⁱ⁾∈ℝ and y⁽ⁱ⁾∈ℤ

New cards

Classification

Want to learn a function f to predict a categorical or discrete target

New cards

What is decision tree?

It is a representation of concepts and hypothesis based on a set of Boolean/If-Then rules.The Boolean conditions will determine the outcomes.

New cards

Example of Decision Tree

(Outlook=“Sunny”) AND (Humidity=“Normal”) → Yes (you should play tennis)but,

(Outlook=“Sunny”) AND (Humidity=“High”) → No (you shouldn’t play tennis)

New cards

internal (non-leaf) node

tests the value of a particular feature (Outlook, Humidity, Wind)

New cards

leaf node

specifies a class label / outcome (in this case whether or not you should play tennis)

New cards

DT’s classify instances..

by sorting them down the tree from the root to some leaf node, which provides the classification of the instance

New cards

Each node in the tree...

specifies a test of some attribute of the instance

New cards

Each branch descending from that node

corresponds to one of the possible values for this attribute

New cards

Classification process

An instance is classified by starting at the root of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node

New cards

Induction Process

Inferring general rules from specific data

New cards

Deductive Process

Using general rules to reach specific conclusions

New cards

Decision trees divide

the feature space into axis parallel rectangles

New cards

DT Algorithm

Recursive and Greedy Algorithm

New cards

Basic DT Algorithm

(i) Select an attribute

(ii) Split the data based on the values of this attribute (iii) Recurse over each new partition

New cards

How to decide which attribute?

Informally, we want that split that gives maximum purity at each node i.e., split such that all (or largest part of) instances belong to a single class

New cards

Entropy

Entropy measures the degree of disorder, impurity or randomness in the system.
It is equivalent to say that entropy measures the homogeneity of a decision tree node.
Entropy is maximized for uniform distributions
Entropy is minimized for distributions that place all their probability on a single outcome

New cards

for a binary random variable

Entropy is maximum at p=0.5, Entropy is zero at p=0 or p=1

New cards

Conditional Entropy

Conditional Entropy is the amount of information needed to quantify the random variable 𝑌given the random variable X. In practice, it tells you the uncertainty/impurity about a variable Y when another variable X is known

New cards

Information Gain

Informally, IG measures how good an attribute is for splitting heterogeneous group into homogeneous groups. Larger information gain corresponds to less uncertainty about 𝑌 given X

New cards

select an attribute..

with largest IG

New cards

Where to stop?

When the node is pure
When you run out of the attributes to recurse on
When there is no more data points in the node (empty)

The class corresponding to a leaf is determined by majority vote

New cards

Discretization

form an ordinal categorical attribute.

Equal interval bucketing
Equal frequency bucketing (percentiles)

New cards

Binary Decision (𝑥𝑘< 𝑡 or 𝑥𝑘 ≥ 𝑡)

consider all possible splits and find the best threshold t

More intensive computationally intensive, but do not need to test every threshold (just take the splits in between each consecutive pair of data points of same class)
The learning algorithm maximizes over all attributes and all possible thresholds of the real-valued attributes

New cards

Handling training examples with missing attribute values

Assign the most common values in the overall data
Assign the most common value based on other examples with same target value

New cards

Overfitting

On small datasets some splits may perfectly fit to the training data, but they won’t generalize well on unseen/unknown datasets

New cards

Sources of Overfitting

Noise, Small number of examples in leaves

New cards

DT model should generalize well

it should work well on examples which you have not seen before

New cards

Train / Test Split

Use training part to learn the model.

Use test to evaluate performance and validate the model.

New cards

Pre-pruning or Early Stopping

Stop the algorithm (recursion) before it generates a fully grown tree, or before it reaches a base case (pure or empty node)

New cards

Criteria to early stopping

Minimum number of samples in parents (before splitting)
Minimum number of samples on leaf nodes
Maximum depth of a tree
Minimum IG

New cards

Post Pruning

Grow full tree, then remove branches or nodes to improve the model’s ability to generalize

New cards

Reduced error pruning

remove branches that do not affect the performance on test/validation data

greedily remove one that improves validation set accuracy

New cards

Minimum impurity decrease

remove nodes if the decrease in impurity is lower that a given threshold

New cards

Handling attributes with high cardinality (many values)

penalize the attribute A by incorporating a term called “Split Information”

New cards

Split Information

sensitive to how broadly and uniformly the attribute splits the data

where 𝑝(𝑋) = 𝑥 corresponds to the proportion of the data such that value x is assigned to attribute X.

New cards

Gain Ratio

use instead of IG

New cards

binary tree properties

A complete binary tree of height h has 2_^h+1-1 nodes
Number of nodes at depth d is 2^d

New cards

Decision Tree Properties

Due to the complexity inherent to decision tree structures, it usually overfits (pruning is always a good idea)
Simple models that explain the data are usually preferred over more complicated ones
Finding the small tree that explains the concept to be learned as good as a larger one is known to be an NP-hard problem
Greedy heuristic: pick the best attribute (highest IG) at each stage

New cards

Common DT methods

ID3 (Iterative Dichotomiser 3)

CART (Classification And Regression Trees)

C4.5 (improvements over ID3)

CHAID (Chi Squared Automatic Interaction Detection)